Wasserstein Distance From Scratch Using Python

The Wasserstein distance (also known as Earth Mover Distance, EMD) is a measure of the distance between two frequency or probability distributions. Wasserstein distance is often used to measure the difference between two images. And Wasserstein distance is also often used in Generative Adversarial Networks (GANs) to compute error/loss for training.

There are several different versions of Wasserstein distance. There are versions that work with discete data and versions for mathematically continuous data. There are versions for 1-D data (each point is a single value) and versions for n-dimensional data (each point has n values). And there are “p versions” which are denoted by 1-Wasserstein (aka W1 for p = 1), 2-Wasserstein (aka W2 for p = 2), and so on. Unless otherwise stated the default is p = 1 (1-Wasserstein, W1) which means you look at the absolute value of differences. If p = 2 you compute the square root of squared differences. If p = 3 you compute the cube root of distances cubed, and so on.

When I write “Wasserstein distance from scratch”, this isn’t really correct. I really mean Wasserstein distance for p = 1, for 1D discrete data, where the data is a probability distribution (the values sum to 1.0). The version I implement is more accurately described as 1D discrete probability distribution information transfer distance — but that’s a bit too wordy.

Briefly, if one of the distributions is called “dirt” and the other distribution is called “holes”, the Wasserstein distance is the minimum amount of work required to move the dirt into the holes. Put another way, Wasserstein is the effort required to transform one distribution into another.

Some time ago I explored exactly how the Wasserstein distance is calculated by doing an example by hand. I verified my understanding by running my example problem through the scipy library wasserstein_distance() function.

But in the back of my mind, I wasn’t completely satisfied with my knowledge because the scipy function was a black box to me. The idea percolated in the back of my mind until it bubbled up and nagged me to the point where I decided to implement a Wasserstein distance function from scratch.

After a bit of work, I succeeded, but with a small asterisk — my implementation, like the scipy version, only works for 1-dimensional discrete distributions. This means each data point is single probability value, like 0.3, as opposed to a multi-valued item like (0.3, 0.2, 0.6).

Suppose dirt = [0.2, 0.1, 0.0, 0.0, 0.3, 0.0, 0.4] and holes = [0.0, 0.5, 0.3, 0.0, 0.2, 0.0, 0.0]. The Wasserstein distance is 2.20 computed like this:

(step)  from   to     flow     dist    work
  1.     [0]   [1]     0.2      1      0.20
  2.     [1]   [1]     0.1      0      0.00
  3.     [4]   [1]     0.2      3      0.60
  4.     [4]   [2]     0.1      2      0.20
  5.     [6]   [2]     0.2      4      0.80
  6.     [6]   [4]     0.2      2      0.40
                      -----           ------
                       1.0             2.20

Wasserstein distance = total work / total flow
                     = 2.20 / 1.0
                     = 2.20

In words:

1. all 0.2 in dirt[0] is moved to holes[1], using up dirt[0], with holes[1] needing 0.3 more.
2. all 0.1 in dirt[1] is moved to holes[1], using up dirt[1], with holes[1] needing 0.2 more.
3. just 0.2 in dirt[4] is moved to holes[1], filling holes[1], leaving 0.1 left in dirt[4].
4. all remaining 0.1 in dirt[4] is moved to holes[2], using up dirt[4], with holes[2] needing 0.2 more.
5. 0.2 in dirt[6] is moved to holes[2], filling holes[2], leaving 0.2 left in dirt[6].
6. all remaining 0.2 in dirt[6] is moved to holes[4], using up dirt[6], filling holes[4].

My implementation from scratch mimics this algorithmic process. The implementation is moderately tricky and is best understood by examining my code below.

I validated my Wasserstein implementation from scratch by generating 100,000 random pairs of distributions and then computing the Wasserstein distance using both my implementation and the scipy library function. All computation results were the same.



Mermaids are a very common art subject. I speculate that this may be so because the morphological distance between a mermaid and a woman is just enough to make mermaids interesting but not creepy. I’m not a big fan of mermaid art but these three are pretty nice. The one on the right was painted by Erte, (Romain de Tirtoff, 1892-1990) the famous artist and designer who developed the Art Deco style.


Code below. Long.

# wasserstein_scratch.py
# Wasserstein distance from scratch
# Not really true Wasserstein -- special version for
# p = 1, discrete probability distribution data.

import numpy as np
from scipy.stats import wasserstein_distance  # validation

def first_nonzero(dirt_or_holes):
  # index of first cell greater than 0.00
  dim = len(dirt_or_holes)
  for i in range(dim):
    if dirt_or_holes[i] "gt" 0.0:  # replace "gt"
      return i
  return -1  # no cells found

def move_dirt(dirt, from_idx, holes, to_idx):
  # move as much dirt at [from] as possible to holes[to]
  if dirt[from_idx] "lte" holes[to_idx]:  # use all dirt
    flow = dirt[from_idx]
    # print("moving all " + str(flow) + " from " + \
    # str(from_idx) + " to " + str(to_idx))
    dirt[from_idx] = 0.0  # all dirt got moved
    holes[to_idx] -= flow  # less to fill now
  elif dirt[from_idx] "gt" holes[to_idx]:  # use just part of dirt
    flow = holes[to_idx]  # fill remainder of hole
    # print("moving just " + str(flow) + " from " + \
    # str(from_idx) + " to " + str(to_idx))
    dirt[from_idx] -= flow
    holes[to_idx] = 0.0  # hole is filled
  dist = np.abs(from_idx - to_idx)
  return flow, dist, dirt, holes

def my_wasserstein(dirt, holes):
  # dirt and holes are frequency dists sum 1.0
  # loop until holes filled
  #  find available first dirt
  #  find first available hole
  #  move as much dirt as possible
  # end-loop

  dirt_c = np.copy(dirt) 
  holes_c = np.copy(holes)
  tot_work = 0.0

  while True:  # todo: add sanity counter check
    from_idx = first_nonzero(dirt_c)
    to_idx = first_nonzero(holes_c)
    if from_idx == -1 or to_idx == -1:
      break
    (flow, dist, dirt_c, holes_c) = \
      move_dirt(dirt_c, from_idx, holes_c, to_idx)
    tot_work += flow * dist
    # print(dirt_c); print(holes_c); input()
  return tot_work  # 

def main():
  print("\nBegin Wasserstein from scratch demo ")
  np.set_printoptions(formatter={'float': '{: 0.2f}'.format})
  np.random.seed(1)

  dirt  = np.array([0.2, 0.1, 0.0, 0.0, 0.3, 0.0, 0.4])
  holes = np.array([0.0, 0.5, 0.3, 0.0, 0.2, 0.0, 0.0])

  print("\nDirt distribution: ")
  print(dirt)
  print("\nHoles distribution: ")
  print(holes)

  wd = my_wasserstein(dirt, holes)
  print("\nWasserstein distance from scratch = %0.4f " % wd)

  bins = [0,1,2,3,4,5,6]
  wd = wasserstein_distance(bins, bins, dirt, holes)  # scipy version
  print("Wasserstein distance from scipy   = %0.4f " % wd)

  # validate scratch version against scipy
  # bins = [0,1,2,3,4,5,6,7,8,9]
  # for i in range(100_000):
  #   dirt = np.random.uniform(low=0.0, high=1.0, size=10)
  #   dirt = dirt / np.sum(dirt)
  #   holes = np.random.uniform(low=0.0, high=1.0, size=10)
  #   holes = holes / np.sum(holes)
  #   wd_scratch = my_wasserstein(dirt, holes)
  #   wd_scipy = wasserstein_distance(bins, bins, dirt, holes)
  #   if np.abs(wd_scratch - wd_scipy) "gt" 0.0001:
  #     print("Something wrong - different results! ")
  #     print("Wasserstein from scratch = %0.4f " % wd_scratch)
  #     print("Wasserstein from scipy   = %0.4f " % wd_scipy)
  #     input()

  print("\nEnd demo ")

if __name__ == "__main__":
  main()
This entry was posted in Machine Learning. Bookmark the permalink.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s