Log Loss and Cross Entropy are Almost the Same

Log loss and cross entropy are measures of error used in machine learning. The underlying math is the same. Log loss is usually used when there are just two possible outcomes that can be either 0 or 1. Cross entropy is usually used when there are three or more possible outcomes.

Suppose you are looking at a really weirdly shaped spinner that can come up “red”, “blue”, or “green”. You do some sort of physics analysis and predict the probabilities of (red, blue, green) are (0.1, 0.5, 0.4). Then you spin the spinner many thousands of times to determine the true probabilities and get (0.3, 0.2, 0.5). To measure the accuracy of your prediction you can use cross entropy error which is best explained by example:

CE = -( ln(0.1)(0.3) + ln(0.5)(0.2) + ln(0.4)(0.5) )
   = -( (-2.3)(0.3) + (-0.69)(0.2) + (-0.91)(0.5) )
   = -( -0.69 + -0.14 + -0.46 )
   = 1.29

In words, cross entropy is the negative sum of the products of the logs of the predicted probabilities times the actual probabilities. Smaller values indicate a better prediction.

But, scenarios where predictions are different probabilities are rare. Most often predictions are discrete. For example, suppose you are trying to predict the political leaning (conservative, moderate, liberal) of a person based on their age, annual income, and so on. You encode (conservative, moderate, liberal) as [(1,0,0) (0,1,0) (0,0,1)]. Now suppose that for a particular age, income, etc. your prediction is (0.3, 0.6, 0.1) = moderate, because the middle value is largest.

Using cross entropy, the error is:

CE = -( ln(0.3)(0) + ln(0.6)(1) + ln(0.1)(0) )
   = -( 0 + (-0.51)(1) + 0 )
   = 0.51

Notice that in this common scenario, because of the 0s and a single 1 encoding, only one term contributes to cross entropy.

Now log loss is the same as cross entropy, but in my opinion, the term log loss is best used when there are only two possible outcomes. This simultaneously simplifies and complicates things.

For example, suppose you’re trying to predict (male, female) from things like annual income, years of education, etc. If you encode male = 0 and female = 1 (notice you only need one number for a binary prediction) and your prediction is 0.3 (male, because it’s less than 0.5) and the actual result is male, the log loss is:

LL = -( ln(0.3) )
   = 1.20

In words, for log loss with binary prediction, you just take the negative log of your predicted probability of the true result. This is the same as cross entropy.

So, pretty simple in principle. However, understanding all the things related to log loss and cross entropy is difficult because there are so many interrelated details, plus, vocabulary is used inconsistently.

Note: Thanks to Steve Sadler for pointing out typos in the intermediate part of my calculations. Now fixed.