Interpreting the Result of a PyTorch Loss Function During Training

The bottom line: When you train a PyTorch neural network, you should always display a summary of the loss values so that you can tell if training is working or not. The exact meaning of the summary loss values you display depends on how you compute them. In most cases the summary loss values are not comparable between different batch sizes, so you really just want to see if summary loss values are decreasing or not.

This is a bit tricky to explain. A snippet of PyTorch training code looks like this:

loss_func = T.nn.CrossEntropyLoss()
optimizer = T.optim.SGD(net.parameters(), lr=lrn_rate)

for epoch in range(0, max_epochs):
  epoch_loss = 0  # accumulated for one full epoch

  for (batch_idx, batch) in enumerate(train_ldr):
    X = batch['predictors']  # inputs
    Y = batch['targets']     # correct class / targets

    optimizer.zero_grad()
    oupt = net(X)                  # computed outputs
    loss_val = loss_func(oupt, Y)  # average per item
    epoch_loss += loss_val.item()  # sum of averages
    loss_val.backward()
    optimizer.step()

  if epoch % 100 == 0:  # every 100 epochs
    print("epoch = %4d loss = %0.4f" % (epoch, epoch_loss))

print("Training complete")

You process one batch of training items at a time. The loss_func() returns the average loss for the items in the batch. You usually don’t want to print the loss value for each batch, or even for each epoch because that’d be too much information. The code snippet above accumulates the batch loss values so the value displayed is a sum of averages.

Left: Training example with batch size = 10, the accumulated average loss values are about 20.0. Right: The same data but with batch size = 20, the accumulated average loss values are about 10.0.

Here’s a concrete example (below). Suppose you have just 8 items, and you use a batch size of 2. Some loss values might look like the top part, where the displayed accumulated loss value would be 12.00. But if you use a batch size of 4 and the loss values were exactly the same (they wouldn’t be in practice because gradients are computed by batch), you would see an accumulated loss value of 6.00.

bat_sz = 2

item    loss        bat_loss
-----------------------------
0        4.0  9.0 / 2 = 4.50
1        5.0

2        2.0  6.0 / 2 = 3.00
3        4.0

4        3.0  5.0 / 2 = 2.50
5        2.0

6        3.0  4.0 / 2 = 2.00
7        1.0

               accum_bat_loss
       (23.0)          12.00

=============================

bat_sz = 4

item    loss         bat_loss
-----------------------------
0        4.0  15.0 / 4 = 3.75
1        5.0
2        2.0  
3        4.0

4        3.0   9.0 / 4 = 2.25
5        2.0
6        3.0   
7        1.0

                accum_bat_loss
       (23.0)            6.00

This example demonstrates that if you accumulate the return values of the loss_func(), as is the usual approach, the displayed accumulated loss value depends on the batch size, where smaller batch sizes generate larger accumulated loss values because there are more accumulations.

One possible approach to make accumulated loss values somewhat comparable for different batch sizes is to multiply the sum of the accumulated averages by the batch size. In the example above, for bat_sz = 2, you’d have 12.00 * 2 = 24.00, and for bat_sz = 4, you’d have 6.00 * 4 = 24.00. This is a normalized sum of averages. Or, you could take another approach and divide the sum of the averages by the number of batches (not the batch size) to get an approximate average loss per item. But this is an approximation because the average of a set of averages is equal to the overall average but only if all batch sizes are exactly the same — it’s common for the size of the last batch in an epoch to be smaller than all the other batches (if the total number of training items is not evenly divisible by the batch size).

OK, that’s complicated.

But the only way for loss values to be completely comparable for different batch sizes is to compute an average loss per item (in all training items). For example, in pseudo-code:

loop each epoch:
  epoch_loss = 0
  loop each batch:
    use inputs to compute outputs
    loss = compare computed outputs to targets (avg)
    un_avg_loss_sum = loss * actual_batch_size
    epoch_loss += un_avg_loss_sum (sum)
  end-loop

  if epoch mod 100 == 0:
    avg_loss_per_item = epoch_loss / tot_number_train_items
    print(avg_loss_per_item)
end-loop

But these complicated approaches really aren’t useful in most scenarios. In most cases the simple sum of averages approach is good enough to tell if training is working or not.

If you want to compare the overall loss of two different models that were trained using two different batch sizes, then you should write a custom function that iterates through all training data using each trained model and computes loss (without updating any weights of course).

Very tricky stuff.

Neural networks are all about mathematics. Gambling is all about mathematics too. The James Bond series of movies has featured casino scenes in 11 of the 24 films to date as I write this post: “Dr. No” (1962), “Thunderball” (1965), “On Her Majesty’s Secret Service” (1969), “Diamonds are Forever” (1971), “The Man with the Golden Gun” (1974), “For Your Eyes Only” (1981), “License to Kill” (1989), “Goldeneye” (1995), “The World is Not Enough” (1999), “Casino Royale” (2006), and “Skyfall” (2012).