The softmax function appears in many machine learning algorithms, in particular neural networks and prediction markets. The idea is, if you have a set of values, to scale them so they sum to 1.0 and therefore can be interpreted as probabilities.
For example, suppose you have three values, (x0, x1, x2) = (3.0, 5.0, 2.0). The softmax function for any value xj expressed mathematically is:
In words, find the sum of e (“Euler’s number” — not to be confused with “Euler’s constant”) raised to each x value. The softmax for a particular x is e raised to x divided by the sum. So:
exp(3.0) = 20.0855 exp(5.0) = 148.4132 exp(2.0) = 7.3891 sum = 175.8878
And the softmax values are:
s(3.0) = 20.0855 / 175.8878 = 0.12 s(5.0) = 148.4132 / 175.8878 = 0.84 s(2.0) = 7.3891 / 175.8878 = 0.04
Notice the softmax values sum to 1.0. In practice, calculating softmax values can go wrong if any x value is very large — the exp() of even a moderate-magnitude positive number can be astronomically huge, which makes the scaling sum huge, and dividing by a huge number can cause arithmetic computation problems.
A trick to avoid this computation problem is subtract the largest x value from each x value. It turns out that the properties of the exp() function give you the same result but you avoid extreme large numbers.
For (3.0, 5.0, 2.0), the largest value is 5.0. Subtracting 5.0 from each gives (-2.0 0.0, -3.0), and so:
exp(-2.0) = 0.1353 exp(0.0) = 1.0000 exp(-3.0) = 0.0498 sum = 1.1852
s(3.0) = 0.1353 / 1.1852 = 0.12 s(5.0) = 1.0000 / 1.1852 = 0.84 s(2.0) = 0.0498 / 1.1852 = 0.04
which are the same softmax values as when computed directly. Notice that all the (max – x) values will be negative, or 0.0 for the largest x, so you avoid e raised to large positive values. The max trick isn’t entirely foolproof however, because e raised to a very small value can get very close to 0.0 which can also potentially cause computation problems, which however, are usually not troublesome in practice.