Working with deep neural networks in PyTorch or any other library is difficult for several reasons. One reason is that there are a huge number of low-level details. For example, when creating a multi-class classifier you have two common design options (there are many less-common options too). Option #1: Use log_softmax() activation on the output nodes in conjunction with NLLLoss() when training (“negative log-likelihood loss”). Option #2: You can use no activation on the output nodes (or equivalently, identity() activation) in conjunction with CrossEntropyLoss() when training.
I give fairly detailed examples of the two approaches at https://jamesmccaffrey.wordpress.com/2020/06/11/pytorch-crossentropyloss-vs-nllloss-cross-entropy-loss-vs-negative-log-likelihood-loss/.
# log_soft_demo.py # Python 3.7.6 (Anaconda3-2020.02) # PyTorch 1.6.0 Windows 10 import torch as T device = T.device("cpu") print("\nBegin softmax and log_softmax() demo \n") t1 = T.tensor([1.0, 3.0, 2.0], dtype=T.float32).to(device) sm = T.nn.functional.softmax(t1, dim=0) lsm = T.nn.functional.log_softmax(t1, dim=0) l_sm = T.log(T.nn.functional.softmax(t1, dim=0)) T.set_printoptions(precision=4) print("tensor t1 = ", end=""); print(t1) print("softmax(t1) = ", end=""); print(sm) print("log_softmax(t1) = ", end=""); print(lsm) print("log(softmax(t1)) = ", end=""); print(l_sm) print("\nEnd demo ")
I computed softmax() and log_softmax() and log(softmax) of [1.0, 3.0, 2.0] using Excel, and then again using PyTorch.
Now on the one hand, this is all the information that is needed to implement a PyTorch multi-class classifier. But behind the scenes there are many details. These details can be confusing if you have a semi-theoretical knowledge of neural network — meaning, what about softmax() activation on the output nodes? Briefly, in theory you want to apply softmax() to the raw output nodes values (called “logits”) so that the sum of the output nodes is 1.0 and the values can be loosely interpreted as probabilities. Then you compare the pseudo-probabilities with the target output values. For example, a target output might be (0, 0, 1, 0) and the softmax computed output might be (0.1, 0.2, 0.6, 0.1). The differences between computed outputs and target outputs is then used to adjust the network weights so that the computed output values get better.
But PyTorch examples usually don’t use this approach. In turns out that computing softmax() is astonishingly difficult if you want to avoid arithmetic underflow or overflow. (Believe me, I’ve tried.) So, for the sake of engineering, PyTorch uses log_softmax() which significantly reduces the likelihood of arithmetic overflow (but unfortunately is still susceptible to underflow).
Somewhat unfortunately, the name of the PyTorch CrossEntropyLoss() is misleading because in mathematics, a cross entropy loss function would expect input values that sum to 1.0 (i.e., after softmax()’ing) but the PyTorch CrossEntropyLoss() function expects inputs that have had log_softmax() applied.
Put another way: computing softmax is error-prone. Computing log_softmax is less error-prone. Therefore PyTorch usually uses log_softmax, but this means you need the special NLLLoss() function. Because of this confusion, PyTorch combines the techniques into no activation plus CrossEntropyLoss() — which turns out to be even more confusing for beginers.
Details, details, details. But interesting, interesting, interesting.
An artificial neural network is a crude approximation of biological neurons. Both real neurons and artificial neurons have a lot of interesting detail. If you’ve ever looked at a bird feather closely, you’ll have noticed the incredible amount of tiny details it has. Left: Real feather earrings on actress Tia Carrere. Center: Real feather earrings on actress Patricia Velasquez. Right: Artificial feather earrings on actress Angelina Jolie. Both the real and the artificial feathers are very interesting to me because of the detail.