Example of Computing Kullback-Leibler Divergence for Continuous Distributions

In this post, I present an example of estimating the Kullback-Leibler (KL) divergence between two continuous distributions using the Monte Carlo technique. Whoa! Just stating the problem has a massive amount of information.

The KL divergence is the key part of the ELBO (evidence lower bound) loss function, which in turn is the key to the creation of Variational Autoencoder (VAE) neural systems that generate synthetic data.

You can tell right away that there are dozens of interrelated topics here — individually they are moderately complicated, but when you combine all of them, things get fantastically complicated. This is the general challenge in all machine learning techniques.

A good way to think about what a continuous distribution for the context of this post is focusing on one specific example: the Normal (aka Gaussian) distribution is characterized completely by a mean (mu, average value) and a standard deviation (sd, sigma, measure of spread). If you graph the Normal distribution you get a bell-shaped curve with X from -infinity to +infinity — any X value is possible (hence “continuous”) but if you pick an X randomly, it will almost always be between -4 * sigma and +4 * sigma. The curve is called the probability density function (PDF). The height of the PDF curve at a value x is not the probability of x but is a measure of how likely x is.

The Kullback-Leibler divergence is a number that describes how different two distributions (P and Q) are. If KL(P, Q) = 0, the two distributions are the same. The larger the value of KL(P, Q), the greater the difference between P and Q. Note that because of the way KL is defined mathematically, KL(P, Q) is not necessarily equal to KL(Q, P) — KL is not symmetric. KL can be defined for continuous distributions (the topic of this post) or for discrete distributions (a different topic).

The Monte Carlo technique loosely means “an estimate computed by simulation”. Suppose you wanted to estimate the probability of getting Heads if you flip a specific coin. You could flip the coin 100 times. If you got 53 Heads (and 47 Tails) your estimate of pr(Heads) = 53/100 = 0.53.

The demo has three parts. The first part sets up P and Q and illustrates key concepts. The second part computes a KL(P, Q) divergence using the Monte Carlo technique. The third part computes KL for the same P and Q using a magic math equation.

The first part sets up P as a Normal distribution with mean = 0.0 and standard deviation = 1.0 — this is the Standard Normal distribution. A second distribution Q with mean = 3.0 and standard deviation = 1.4 is created. Next a value z = 2.5 is set:

p_mu = T.tensor([0.0], dtype=T.float32)
p_sd = T.tensor([1.0], dtype=T.float32)

q_mu = T.tensor([3.0], dtype=T.float32)
q_sd = T.tensor([1.4], dtype=T.float32)

P = T.distributions.Normal(p_mu, p_sd)
Q = T.distributions.Normal(q_mu, q_sd)

print("\nSetting z = 2.5 ")
z = T.tensor([2.5], dtype=T.float32)

I use PyTorch tensors but the exact same ideas work with NumPy arrays too.

Next, the demo computes two PDF values:

log_p_df_z = P.log_prob(z)
p_df_z = T.exp(log_p_df_z).item()

log_q_df_z = Q.log_prob(z)
q_df_z = T.exp(log_q_df_z).item()

print("Semi-likelihood z comes from P = %0.4f " % p_df_z)
print("Semi-likelihood z comes from Q = %0.4f " % q_df_z)

The P.log_prob(z) function gives the log of the value of the P = N(0,1) distribution PDF at z = 2.5. The log() is used because 1.) raw PDF values are often extremely small, and 2.) log() values are much easier to work with. The minor downside is that log() values are negative. I use T.exp() to convert the values from log to raw. I write “semi-likelihood” to emphasize the values aren’t true probabilities.

The likelihood value that z = 2.5 came from P = N(0,1) is 0.0175 and the likelihood z came from Q = N(3, 1.4) is 0.2674. In other words, it’s more likely z came from Q than from P. This makes sense because z is much closer to the mean of Q than it is to the mean of P.

The second part of the demo uses the concepts to estimate KL(P,Q) using the Monte Carlo technique:

n_samples = 100

print("\nEstimating KL(P,Q) using %d samples " % n_samples)
sum = 0.0
for i in range(n_samples):
  z = Q.rsample()
  kl = Q.log_prob(z) - P.log_prob(z)  # magic math
  sum += kl.item()

KL = sum / n_samples
print("Estimated Monte Carlo KL = %0.4f " % KL)

Update: the simulation code actually computes reverse KL(P,Q) because that’s what is used in ELBO loss for variational autoencoders, which was the motivation for this post. See the comments below.

The code iterates 100 times. In each iteration a random value z is drawn from the Q distribution using the rsample() method. The PDF likelihood values that z comes from P and from Q are computed (in log form). The magic math is that the KL divergence is the difference between the log_pdf values. The 100 estimated KL(P, Q) values are summed and then the average is computed. The result is 4.3975 which indicates the two distributions aren’t very close.

The last part of the demo uses some fancy math to estimate KL(P,Q). It turns out that there’s a special case when P is standard Normal ( mean = 0 and sd = 1). In this special case, KL(P,Q) can be computed directly like so:

print("\nEstimating KL(P,Q) using special N(0,1) math ")
q_logvar = T.log(q_sd.pow(2))
KL = -0.5 * T.sum(1 + q_logvar - q_mu.pow(2) - q_logvar.exp())
print("Special case computed KL = %0.4f " % KL)

This is not at all obvious and is one of the many “magic math” equations in machine learning. Because of this very nice property, systems that need to compute KL for two distributions often assume one or both are N(0,1) even when they might not be — the assumption makes calculations much, much easier.

In machine learning, computing KL divergence between two distributions is most often seen in variational autoencoders, but the technique is used in several other problem scenarios too, such as Bayesian neural networks where each network weight is a Normal distribution instead of a single fixed value.

I’ve always thought that an effective travel poster is one that finds the right distance between reality and a happy fantasy ideal. Here are three old Hawaii travel posters like the ones that contributed to a huge increase in Hawaii tourism starting in the late 1950s. Left: By artist Bern Hill. Center: By artist Paul Lawler. Right: By artist Joseph Feher.

Complete demo code:

# kl_estimate_demo.py

# import numpy as np
import torch as T

print("\nBegin KL continuous Monte Carlo demo ")
T.manual_seed(0)

print("\nCreating P = (N, 0, 1) and Q = (N, 3, 1.4) ")
p_mu = T.tensor([0.0], dtype=T.float32)
p_sd = T.tensor([1.0], dtype=T.float32)

q_mu = T.tensor([3.0], dtype=T.float32)
q_sd = T.tensor([1.4], dtype=T.float32)

P = T.distributions.Normal(p_mu, p_sd)
Q = T.distributions.Normal(q_mu, q_sd)

print("\nSetting z = 2.5 ")
z = T.tensor([2.5], dtype=T.float32)

log_p_df_z = P.log_prob(z)
p_df_z = T.exp(log_p_df_z).item()

log_q_df_z = Q.log_prob(z)
q_df_z = T.exp(log_q_df_z).item()

print("Semi-likelihood z comes from P = %0.4f " % p_df_z) # .0175
print("Semi-likelihood z comes from Q = %0.4f " % q_df_z) # .2674

n_samples = 100

print("\nEstimating KL(P,Q) using %d samples " % n_samples)
sum = 0.0
for i in range(n_samples):
  z = Q.rsample()
  kl = Q.log_prob(z) - P.log_prob(z)  # magic math
  sum += kl.item()

KL = sum / n_samples
print("Estimated Monte Carlo KL = %0.4f " % KL)

print("\nEstimating KL(P,Q) using special N(0,1) math ")
# KLD = 0.5 * sum(1 + log(sigma^2) - u^2 - sigma^2)
q_logvar = T.log(q_sd.pow(2))
KL = -0.5 * T.sum(1 + q_logvar - q_mu.pow(2) - q_logvar.exp())
print("Special case computed KL = %0.4f " % KL)

print("\nEnd demo ")

5 Responses to Example of Computing Kullback-Leibler Divergence for Continuous Distributions

Ali G. (@__ali__g) says:

August 23, 2022 at 5:16 am

Nice intro. Thx.
However, there’s one major error. You’re actually computing KL(Q,P) because you sample z from Q and order the substraction as: Q.log_prob(z) – P.log_prob(z)
jamesdmccaffrey says:

August 23, 2022 at 11:06 am

I’m not sure about this. When I run the code as-is but with 100,000 iterations, I get KL = 4.6431 using the simulation and KL = 4.6435 using the special case equation. So the two techniques agree. It could be the case that the special case equation, which comes from several research papers on Variational Autoencoders, is computing KL(Q,P) instead of KL(P,Q).
jamesdmccaffrey says:

August 23, 2022 at 12:22 pm

After some more thought, the key variable in the Monte Carlo simulation is whether to sample from P or Q. I can’t remember what resource I used to determine that the simulation for KL(P,Q) should sample from Q rather than P. So you could well be correct that the demo code computes KL(Q,P) instead of KL(P,Q) as intended. I’ll try to investigate when I get some time.
jamesdmccaffrey says:

August 23, 2022 at 1:28 pm

After more, more thought — the motivation for this blog post was the use of KL divergence in the ELBO loss function used by Variational Auto Encoders. It turns out that VAE ELBO uses reverse KL divergence because it’s better suited for minimizing loss. So, yes, the simulation code does in fact compute KL(Q,P) instead of KL(P,Q).

That should have been totally obvious to me just by looking at the definition of KL(P,Q). So thank you Ali G for pointing this out.

See the post at: agustinus.kristia.de/techblog/2016/12/21/forward-reverse-kl/ for a nice explanation.
Ali G. (@__ali__g) says:

August 23, 2022 at 1:32 pm

You’re welcome sir. Thank you for the great tutorial.