## The Beta Distribution in Machine Learning

The beta distribution appears in several machine learning topics. Like many math distributions, the beta distribution is both simple (to use) and complex (to fully understand).

The beta distribution is best explained by starting with an example. I’ll use Python because, annoyingly, there’s no built-in beta function for C#.

```import numpy as np  # beta() is in here
np.random.seed(1)   # make reproducible
p1 = np.beta(a=1, b=1)  # probability1
p2 = np.beta(a=1, b=1)
p3 = np.beta(a=1, b=1)
```

This code will return three random probability values. Each will be between 0.0 and 1.0 and be uniformly distributed with an average of 0.5.

The a and b parameters, often called alpha and beta in math literature (which is absolutely terrible because now “beta” has two meanings — the distribution and the parameter) define how the distribution works. It’s similar to the way the Normal, Gaussian, bell-shaped distribution has two parameters, mean and standard deviation, that define what kind of values you get.

For the beta distribution, you always get a probability value between 0.0 and 1.0 where the average probability returned is a / (a + b). When a = 1 and b = 1, the average return value is 1 / (1 + 1) = 0.5 which is a uniform distribution.

Suppose a = 3 and b = 1. The average return probability value will be 3 / (3 + 1) = 0.75 so most returned values will be greater than 0.75 even though there’s a chance to get any value. Here’s a graph of pulling 10,000 samples from Beta(a=3, b=1).

So, that’s pretty easy. But why would the beta distribution ever be useful? This is much harder to explain and would take several pages so briefly . . .

Suppose you are observing some random process that emits a series of “success” or a “failure” over time. You start without any knowledge and so you assume P(success) = P(failure) = 0.5. But then you observe: t = 1 success, t = 2 success, t = 3 failure, t = 4 success, t = 5 failure. What is coming next at t = 6?

Using beta, initially a = b = 1. You have 3 success and 2 failure, Set a = 1 + 3 = 4 and b = 1 + 2 = 3. The probability of success for t = 6 is a / (a + b) = 4 / (4 + 4) = 4/7 = 0.5714 and you could sample possible outcomes using beta.

This should give you a hint of what the beta distribution is. For a complete explanation, the Wikipedia entry on the topic is very thorough.

Simulation of beta particle decay in Physics