In neural networks, the log_softmax() function is used more often than the regular softmax() function for (at least) three reasons.
1.) computing log_softmax() is slightly less likely to fail due to arithmetic overflow or underflow than computing softmax().
2.) using log_softmax() is slightly more efficient than using softmax() when computing negative log likelihood loss (also called cross entropy error).
3.) using log_softmax() is slightly more efficient than using softmax() when computing network gradients.
1. log_softmax() is safer to compute
The softmax() function accepts a vector of values and returns a normalized vector where the values sum to 1.0. For example, if v = (2.0, 5.0, 3.0) then softmax(v) = (0.04, 0.84, 0.12). The log_softmax() function is just the ln of the softmax() values. Because ln(0.04) = -3.17, ln(0.84) = -0.17, ln(0.12) = -2.17, log_softmax(v) = (-3.17, -0.17, -2.17).
Computing softmax() directly can blow up if any of the input vallues are large. For three values, softmax is:
softmax(x1) = exp(x1) / [ exp(x1) + exp(x2) + exp(x3) ]
softmax(x2) = exp(x2) / [ exp(x1) + exp(x2) + exp(x3) ]
softmax(x3) = exp(x3) / [ exp(x1) + exp(x2) + exp(x3) ]
If any of the xi are evenly moderately large (say about 50.0 or more), exp(xi) gives arithmetic overflow. Even if you escape this scenario, the sum of the exp(xi) in the denominator could be huge and cause artithmetic underflow.
There are two tricks to reduce the likelihood of failure. The first is the exp-max trick, the second is computing log_softmax directly instead of computing softmax() and then computing the ln() of the softmax() result.
The result is:
log_soft(x1) = (x1 – m) – ln[ exp(x1 -m) + exp(x2 – m) + exp(x3 – m) ], where m is the max(x1, x2, x3).
Because m is the max of the xi values, (xi – m) will always be 0 or negative, and exp() will always work. Becuase ln(a / b) = ln(a) – ln(b), the division is replaced by a subtraction.
2. log_softmax() is slightly more efficient when computing negative log likelihood
Suppose you have four items, each of which has three output (logit) values, and four target ordinal values, and you want to compute the negative log likelihood, aka cross entropy error:
target 2.0 5.0 3.0 1 1.0 2.0 7.0 2 6.0 3.0 1.0 1 4.0 3.0 3.0 0
The softmax of the four items is:
0.0420 0.8438 0.1142 1 0.0025 0.0067 0.9909 2 0.9465 0.0471 0.0064 1 0.5761 0.2119 0.2119 0
The log of the likelihoods, the sum of their negatives is, and the average negative log likelihood (cross entropy error) is 0.9464:
ln(0.8438) = -0.1698 ln(0.9909) = -0.0092 ln(0.0471) = -3.0550 ln(0.5761) = -0.5514 sum = 3.7855 avg = 0.9464
But suppose you compute log_softmax() directly instead of computing softmax() first. The log_softmax() values are:
target -3.1698 -0.1698 -2.1698 1 -6.0092 -5.0092 -0.0092 2 -0.0550 -3.0550 -5.0550 1 -0.5514 -1.5514 -1.5514 0
The log likelihoods are all already there so the negative log likelihood is (-0.1698 + -0.0092 + -3.0550 + -0.5514) / 4 = 0.9464. Computing log_softmax() directly saves you the step of applying ln to the softmax() values. A tiny efficiency.
3. log_softmax() is slightly more efficient when computing gradients
Briefly, gradient techniques usually work better optimizing ln(p(x)) than p(x) because the gradient of log(p(x)) is usually better scaled.
In computer science, efficiency is usually very important. I don’t think efficiency is directly related to art, but I like art that is efficient in terms of color and drawing strokes. Three efficient illustrations by Edmond Kiraz (1923-2020). Kiraz had an appealing (to me anyway) and distinctive style, which he called “Les Parisiennes”.