The Hellinger Distance Between Two Probability Distributions Using Python

A fairly common sub-problem when working with machine learning algorithms is to compute the distance between two probability distributions. For example, suppose distribution P = (0.36, 0.48, 0.16) and Q = (0.33, 0.33, 0.33). What is the difference between P and Q?

There are many ways to calculate a distance between two probability distributions. One of the common metrics is called the Hellinger distance. It’s computed as show below on the Wikipedia article for the topic:

As far as I can tell, there is no Hellinger distance function in any of the standard Python libraries, scipy and numpy in particular. But it’s easy to write an H(P,Q) function using Python.

The Hellinger distance is quite well known but for reasons which aren’t really clear to me, the Kullback-Leibler divergence is used far more often. KL is a divergence rather than a distance because KL(P,Q) != KL(Q,P).

Both Hellinger and KL are examples of a class of closely related mathematical ideas called f-divergences. Others include Pearson’s chi-square divergence, Neyman chi-square divergence, alpha-divergence, and Jensen-Shannon divergence.

J.C. Leyendecker was a very famous illustrator in the 1920s and 1930s. I usually post artwork I like; I’m not really a fan of Leyendecker’s work but the artistic distance between his work and work I like is small.

Left: Leyendecker gained fame by his artwork for a series of ads for Arrow Shirts. Center: A cover for the September 1934 issue of “The Saturday Evening Post” magazine. Leyendecker did 322 SEP covers from 1908 to 1946. Artist Norman Rockwell was also famous for his 321 SEP covers from 1916 to 1963. The math distance between the number of SEP covers by Leyendecker and Rockwell is 1. Right: Artwork for an ad for Kuppenheimer, which was a men’s clothing company from 1852-1997.

# hellinger_distance_demo.py

import numpy as np

def H(p, q):
  # distance between p an d
  # p and q are np array probability distributions
  n = len(p)
  sum = 0.0
  for i in range(n):
    sum += (np.sqrt(p[i]) - np.sqrt(q[i]))**2
  result = (1.0 / np.sqrt(2.0)) * np.sqrt(sum)
  return result

def main():
  print("\nBegin Hellinger distance from scratch demo ")
  np.set_printoptions(precision=4, suppress=True)

  p = np.array([9.0/25.0, 12.0/25.0, 4.0/25.0], dtype=np.float32)
  q = np.array([1.0/3.0, 1.0/3.0, 1.0/3.0], dtype=np.float32)

  print("\nThe P distribution is: ")
  print(p)
  print("\nThe Q distribution is: ")
  print(q)

  h_pq = H(p, q)
  h_qp = H(q, p)

  print("\nH(P,Q) = %0.6f " % h_pq)
  print("H(Q,P) = %0.6f " % h_qp)

  print("\nEnd demo ")

if __name__ == "__main__":
  main()