Computing PyTorch Negative Log Loss aka Cross Entropy Error

The PyTorch library has a built-in CrossEntropyLoss() function which can be used during training. Before I go any further, let me emphasize that “cross entropy error” and “negative log loss” are the same — just two different terms for the exact same technique for comparing a set of computed probabilities with a set of expected target probabilities.

In some scenarios, it’s useful to compute cross entropy error without training. Two examples are if you want to compute overall error to compare two models that were trained differently, or if you want to compute overall error during training on a validation dataset to see if over-fitting is starting to occur.

There are two general approaches for computing cross entropy error. You can compute error by iterating one data item at a time, or you can compute error en masse, all at once. The iteration approach allows you to print error one item at a time to see exactly what’s happening, but the approach is slower. The en masse approach is faster but doesn’t let you easily see what’s going on.

Suppose you have just four data items, and each item generates a vector of three values. This corresponds to a multi-class classification problem where there are three possible values to predict. And now suppose the four raw computed output values (“logits”) and their associated target labels are:

                       target
 2.50  -1.50   3.00      2      
 4.00   1.00   2.00      0      
-2.00   1.50   3.50      1      
 5.00   2.00  -3.00      0

To compute cross entropy error, first you compute the softmax() of the logits to convert them into pseudo-probabilities, and then take the ln() of the pseudo-probabilities. You can do this in two steps using the softmax() and ln() functions, but it’s more efficient to use the built-in PyTorch LogSoftmax() function and do it in one step. Then you compute the negative sum of the log_softmax values that correspond to the targets. Then you divide by the number of data items to get an average cross entropy error.

For the data above, the log_softmax values are (with the target values in parentheses):

                               target
 -0.9810   -4.9810  (-0.4810)    2 
(-0.1698)  -3.1698   -2.1698     0  
 -5.6305  (-2.1305)  -0.1305     1 
(-0.0489)  -3.0489   -8.0489     0 

Therefore, the average cross entropy error / average negative log loss is:

-1 * (-0.4810 + -0.1698 + -2.1305 + -0.0489) / 4 = 0.7076.

Here’s an implementation of the iterative approach, where data items are stored in a PyTorch Dataset object, which is now the de facto standard approach for storing and serving up data:

import torch as T 

def my_cee(model, dataset):
  # assumes model.eval()
  LS = T.nn.LogSoftmax(dim=0)   # class -- weird
  sum = 0.0  # sum of CEE all items in dataset 
  n = len(dataset)
  for i in range(n):
    X = dataset[i]['predictors']  # inputs
    Y = dataset[i]['targets']     # target 0, 1, or 2
    with T.no_grad():
      logits = model(X)           # tensor shape [3]

    log_soft_out = LS(logits)
    sum += log_soft_out[Y]

  return -sum.item() / n

This is a short function, but like virtually everything in neural networks and PyTorch, it is very complex because each statement operates at a relatively high level. For example, the LogSoftmax is actually a complicated class with an invisible __call__() method, and its dim parameter is very tricky.

Here’s an implementation of the faster, but even more obscure, all-at-once approach:

def my_cee_quick(model, dataset):
  # assumes model.eval()
  LS = T.nn.LogSoftmax(dim=1)   # class -- weird
  n = len(dataset)
  X = dataset[0:n]['predictors']    # all inputs
  Y = dataset[0:n]['targets']       # all targets
  with T.no_grad():
    logits = model(X)               # all predicteds
  log_softs = LS(logits)            # all log_softmaxs
  Y = Y.reshape(n,1)                # shaped for gather()
  log_probs = log_softs.gather(1, Y)  # extract; dim=1
  sum_of_logs = T.sum(log_probs)      # sum extracted
  return -sum_of_logs.item() / n      # avg of neg sum

When I’m developing a neural network model with PyTorch, I usually start with the iterative version of the custom cross entropy error / negative log loss function so that I can more easily debug the inevitable errors. When the model is fully functional, then I’ll switch to the faster all-at-once version of the function.

I don’t think there’s a moral to this story. I write code almost every day. Writing code is a skill that requires continuous practice, and computer science requires continuous learning. But I greatly enjoy the mental exercise and so I write code mostly because I want to, rather than because I have to.


The computer science English words “iterative” and “iteration” are derived from the Latin word “iter” which means “path” or “route”. Two ancient Rome street scenes by artist Ettore Forti (1850-1940). If I had a time machine, one of the place-times I’d like to visit is ancient Rome. But I doubt that a time machine will be ready anytime soon.

This entry was posted in PyTorch. Bookmark the permalink.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s