Sampling from the Beta Distribution using Python

The beta distribution pops up from time to time in my work with machine learning. Many people are familiar with the Gaussian (also called normal, or bell-shaped) distribution. A particular Gaussian distribution is characterized by a mean and a standard deviation which determine the behavior of the distribution. Each set of (mean, sd) values determine a different Gaussian distribution. For example, if mean = 0.0 and sd = 1.0 then if you draw many sample values (usually called z) from the distribution, you’d expect about 68% of the z values to be between -1.0 and +1.0 and about 95% of the z values to be between -2.0 and +2.0, and so on.

The beta distribution also has two characteristic values, usually called alpha and beta, or more succinctly, just a and b. Each set of (a,b) pairs determine a different beta distribution. When you sample from beta(a,b) each sample value (I usually call them p values) will be between 0.0 and 1.0 and if you sample many values they will average to a / (a+b). For example, if you sample many values from beta(3, 1), each value will be between 0.0 and 1.0 and all the values will average to about 3/4 = 0.75.

The NumPy add-on package for the Python language has a built-in beta() function. For example:

import numpy as np
np.random.seed(0)
for i in range(1000):
  p = np.random.beta(3,1)
  print(p)

will generate 1,000 p-values between 0.0 and 1.0 that average to about 0.75. One of my character flaws is that I’m never completely happy using functions from a code library unless I completely understand the function. And that means I want to be able to implement the function from scratch.

After a bit of research, I found a 1978 research paper titled “Generating Beta Variates with Nonintegral Shape Parameters” by R. C. H. Cheng. The paper provided a basic (meaning somewhat inefficient for 1970s era computers) algorithm. So, I coded up the algorithm using raw Python. I generated 10,000 samples from beta(3,1) and compared the results to the beta() function in the NumPy library and got the same results. I was happy about that.