Avoiding an Exception when Calculating Softmax

The softmax of a set of values returns a set of values that sum to 1.0 so they can be interpreted as probabilities. The softmax function is one of the fundamental tools for machine learning. Suppose you have some neural network classifier to predict if a person is democrat, republican, or other. The raw output values of the network could be something like (3.0, 5.0, 2.0) but the result of softmax would be (0.1142, 0.8438, 0.0420) which would mean P(republican) = 0.8438 and so that’s the prediction because it has the highest probability.

Mathematically, to compute the softmax of a set of values, you compute the exp() of each value, and sum those values. The exp(x) function is Euler’s number (not Euler’s constant) raised to x. Then the softmax of each value x is exp(x) / sum.

The calculation can blow up because the exp(x) function can return astronomically large values for even moderate-sized values of x. One way to avoid an exception is to use the “max trick”. Because of the properties of the exp(x) function, you can find the max of the x values, subtract the max from each x, compute and sum each exp(x-max) and then the softmax of x is as before, exp(x-max) / sum.

For example, for (3.0, 5.0, 2.0), the max is 5.0 and subtracting gives (-2.0, 0.0, -3.0) then the exp() of each is (0.1353, 1.0000, 0.0498). The sum of those values is 1.1851 and 0.1353 / 1.1851 = 0.1142, 1.0000 / 1.1851 = 0.8438, 0.0498 / 1.1851 = 0.0420. Notice that all the exp(x) calculations occur on small or negative values.

A variation of the max trick is to avoid the division operation. To do this you compute the ln() of the sum of the exp(x-max) values and then instead of dividing, you subtract and take the exp(x – max – ln(sum)).

For example, the sum of the exp(x) values is 1.1851 and the ln(1.1851) = 0.1698. Then exp(-2.0 – 0.1698) = 0.1142, exp(0.0 – 0.1698) = 0.8438, and exp(-3.0 – 0.1698) = 0.0420.

A few years ago, when neural networks had to be implemented from scratch, you’d have to know details like this. But in the last two years or so, with the creation of neural network libraries such as TensorFlow and CNTK, all these details are handled for you. But it still good to know what goes on behind the scenes.



Some examples of newspaper headlines that should have thrown an exception.

Advertisements
This entry was posted in Machine Learning. Bookmark the permalink.