Implementing Kullback-Leibler Divergence from Scratch Using Python

The Kullback-Leibler divergence is a number that is a measure of the difference between two probability distributions. I wrote some machine learning code for work recently and I used a version of a KL function from the Python scipy.stats.entropy code library. That library version of KL is very complex and can handle all kinds of scenarios. But my problem scenario was very basic so I wondered how difficult it would be to implement KL from scratch using Python. It turned out to be very easy.

I reviewed my knowledge of KL by going to the Wikipedia article on the topic. The Wikipedia article was excellent — for me anyway. Whether or not a Wikipedia article on a technical topic is good for a particular person depends entirely on that person’s background knowledge of the topic.


The Wikipedia article on Kullback-Leibler Divergence is excellent.

Anyway, the Wikpedia article gave a worked example of KL. Hooray! (Wikipedia authors — please always give a worked example of an algorithm or metric!) So I decided to replicate that example. The example set up a first distribution with three values: P = (9/25, 12/25, 4/25) and a second distribution: Q = (1/3, 1/3, 1/3). The value of KL(P,Q) = 0.085300 and the value of KL(Q,P) = 0.097455.


Implementing KL from scratch was very easy because I only needed a version for discrete distributions and I didn’t need to do any error checking. For example, if any of the cells of the Q distribution are 0 you’d get a divide by zero exception. Or if the values in either P or Q don’t sum to 1, you’ll get an incorrect result.

Small values of KL indicate the two distributions are similar, and larger values indicate greater difference. If KL = 0 the two distributions are the same. KL is not symmetric so KL(P,Q) != KL(Q,P) in general. To get rid of this minor annoyance, you can compute KL in both directions and then either sum, or take the average.

I coded up KL from scratch and ran it with the Wikipedia example data and got the same results.

There’s a trade-off between using a function from a code library and implementing the function from scratch. I prefer implementing from scratch when feasible. Early in my career as a software developer, I worked on complex projects that used C++ with COM technology. We used many code libraries. Dependency problems, incuding DLL Hell, were an absolute nightmare. The more things you implement from scratch, the fewer dependencies you have. So if a function can be implemented from scratch quickly and reliably, I’ll use that strategy.



Sometimes doing things yourself is not a good strategy. Left: TV on-off switch repair. Center: New electrical outlet added — questionable location. Right: One way to prevent a circuit breaker from tripping.


# kl_scratch.py
# Kullback-Leibler from scratch

import numpy as np

def KL(p, q):
  # "from q to p"
  # p and q are np array frequency distributions

  n = len(p)
  sum = 0.0
  for i in range(n):
    sum += p[i] * np.log(p[i] / q[i])
  return sum

def main():
  print("\nBegin Kullback-Liebler from scratch demo ")
  np.set_printoptions(precision=4, suppress=True)

  p = np.array([9.0/25.0, 12.0/25.0, 4.0/25.0], dtype=np.float32)
  q = np.array([1.0/3.0, 1.0/3.0, 1.0/3.0], dtype=np.float32)

  print("\nThe P distribution is: ")
  print(p)
  print("The Q distribution is: ")
  print(q)

  kl_pq = KL(p,q)
  kl_qp = KL(q, p)

  print("\nKL(P,Q) = %0.6f " % kl_pq)
  print("KL(Q,P) = %0.6f " % kl_qp)

  print("\nEnd demo ")

if __name__ == "__main__":
  main()
This entry was posted in Machine Learning. Bookmark the permalink.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s