If you’re new to neural networks, you’ll see the terms “log loss” and “cross entropy error” used a lot. Both terms mean the same thing. Multiple, different terms for the same thing is unfortunately quite common in machined learning (ML). For example, “predictor variable”, “feature”, “X”, and “independent variable” all have roughly the same meaning in ML.

For the rest of this post, I’ll call the idea I’m explaining cross entropy (CE). There are two scenarios. First, a general case, which is more useful to mathematicians. And second, a case specific to ML classification.

In the first, general scenario, CE is used to compare a set of predicted probabilities with a set of actual probabilities. For example, suppose you have a weirdly-shaped, four-sided dice (yes, I know the singular is “die”). Using some sort of physics or intuition you predict that the probabilities for the weird dice are (0.20, 0.40, 0.10, 0.30). Then you toss the dice many thousands of times and determine that the true probabilities are (0.15, 0.35, 0.15, 0.35):

predicted: (0.20, 0.40, 0.10, 0.30) actual: (0.15, 0.35, 0.15, 0.35)

Cross entropy can be used to give a metric of your prediction error. CE is minus the sum of the log of predicted, times actual. If p is predicted probability and a is actual probability, then

So your CE is:

-( ln(0.20)*0.15 + ln(0.40)(0.35) + ln(0.10)(0.15) + ln(0.30)(0.35) ) = 1.33

Somewhat unusually, the CE for a prefect prediction is not 0 as you’d expect. For example, if your four predictions are (0.25, 0.25, 0.25, 0.25) and the four actuals are also (0.25, 0.25, 0.25, 0.25) then the CE is 1.39 (this CE is, not un-coincidentally the ln(4)).

Now in the case of ML classification, the predicted probabilities are values that sum to 1.0 but the “actual” probabilities all have the form of one 1.0 value and the rest 0.0 values. For example, suppose you are trying to predict the political party affiliation of a person and there are four possible values: democrat, republican, libertarian, other. These values would be 1-of-N encoded as democrat = (1,0,0,0), republican = (0,1,0,0), libertarian = (0,0,1,0), other = (0,0,0,1).

And suppose a neural network classifier emits a prediction of (0.50, 0.10, 0.10, 0.30) when the actual party is democrat. The cross entropy error for the prediction is:

-( ln(0.50) * 1 + ln(0.10) * 0 + ln(0.10) * 0 + ln(0.30) * 0 ) = 0.70

Notice that because of the 1-of-N encoding, there’s a lot of multiply-by-zero so all the terms in CE drop out except for one.

For this type of problem scenario, a perfect prediction does nicely give a cross-entropy error of 0 because ln(1.0) = 0.

As a final note, when coding cross entropy error, you have to be careful not to try and compute the ln(0.0) which is negative infinity.