In a PyTorch multi-class classification problem, the basic architecture is to apply log_softmax() activation on the output nodes, in conjunction with NLLLoss() during training. It’s possible to compute softmax() and then apply log() but it’s slightly more efficient to compute log_softmax() directly.
Computing softmax() looks like:
import torch as T def softmax(x): mx = T.max(x) y = T.exp(x - mx) return y / T.sum(y)
Finding the max value is just a math trick to avoid arithmetic overflow.
Computing log_softmax() directly looks like:
def log_softmax(x): mx = T.max(x) lse = T.log(T.sum(T.exp(x - mx))) return x - mx - lse
The reason why log_softmax() is applied to the output nodes is rather subtle. If the target class is at index [i] then the negative log likelihood loss is just the negative of log_softmax() value at [i]. For example, if the log_softmax of a neural output is [-1.6563, -1.7563, -1.5563], and the target class label is , then the NLLLoss() is -(-1.5563) = 1.5563. Quite remarkable. One way to think of the log_softmax() plus NLLLoss() pairing is that log_softmax() actually computes the error and NLLLoss() just extracts the error.
If you just have a single set of log-softmax outputs and a single target class label, you could write an NLLLoss() like so:
def my_nll_loss(oupt, target): # oupt is a vector of log-softmax values result = -oupt[target] return result
If you have a batch of output values and a vector of targets, you can use the clever diag() function like so:
def my_nll_loss(oupt, targets): # oupt is a matrix of log-softmax values out = T.diag(oupt[:,targets]) # one val from each row return -T.mean(out)
In the early days of neural networks, you’d compute softmax() on the output nodes and then explicit CrossEntropy() loss. The softmax() plus CrossEntropy() loss approach and the log_softmax() plus NLLLoss() approach give the same results but the log_softmax() plus NLLLoss() approach is more efficient from an engineering perspective.
In the old science fiction movies I enjoy, efficiency was sometimes achieved by reusing special effects snippets. A cool spaceship appeared in four different movies. Left: “Flight to Mars” (1951) was quickly produced in just a few weeks to take advantage of the publicity surrounding the Academy Award winning “Destination Moon” (1950). The spaceship for “Flight to Mars” was reused three times. Center-Left: “World Without End” (1956) is an OK film. Center-Right: “It! The Terror from Beyond Space” (1958) is a landmark film and the direct inspiration for “Alien” (1979). Right: “Queen of Outer Space” (1958) is better than you might guess based on the title.