The Kullback–Leibler divergence (KL) is a number that compares two probability distributions. KL is the “divergence” to a distribution P from a distribution Q. This is kind of like a distance.
In the image below, on the left, distribution P is some observed frequencies for X = 0 to X = 6. Distribution Q is the mathematical binomial distribution with p = 0.60. The graph shows the two distributions are not very similar. The KL divergence, KL(P, Q) “to P from Q” is -1 * sum[ P * ln(Q/P)] and the calculations shows that equals 0.4041.
Note that KL divergence is not a distance metric because KL(P,Q) does not equal KL(Q,P).
In the image, on the right, P is the same observed probability distribution but Q is the mathematical Uniform distribution. The graph shows the Uniform distribution is much closer to the observed distribution than the Binomial distribution. The KL divergence is much smaller – 0.0437.
So, what’s the point? KL divergence can be used in several ways. In particular, an interesting unsupervised machine learning technique called Variational Autoencoders (which I’ll explain in a future post).