Neural Network Classification, Categorical Data, Softmax Activation, and Cross Entropy Error

When using neural networks for classification, there is a relationship between categorical data, using the softmax activation function, and using the cross entropy error function. Suppose you want to use a neural network to classify data that looks like this:

5.0 7.0 2.0 3.0 red

9.0 8.0 4.0 5.0 blue

2.0 2.0 4.0 7.0 red

6.0 5.0 2.0 8.0 green

8.0 7.0 8.0 7.0 blue

. . .

There are four numeric inputs and one categorical value (red, green, blue) to classify. In this situation it’s recommended that you use 1-of-N encoding for the categorical data like so:

5.0 7.0 2.0 3.0 1 0 0

9.0 8.0 4.0 5.0 0 0 1

2.0 2.0 4.0 7.0 1 0 0

6.0 5.0 2.0 8.0 0 1 0

8.0 7.0 8.0 7.0 0 0 1

. . .

There’s a long explanation behind this but just assume this encoding is a good thing. By the way, if there are only two categorical values being predicted you’d use 1-of-(N-1) encoding, and if you have categorical input you use 1-of-(N-1) encoding on it. Now with this output encoding you want the neural network output layer to have three neurons. In order to compare an output with an input like (0 1 0) instead of using a normal sigmoid or step function, you want the output to be three values between 0.0 and 1.0 that sum to 1.0. Why? Suppose an output is (1.0 1.0 1.0) — it’s not entirely clear how to evaluate that against a categorical value of (1 0 0). To get output that sums to 1.0 you can use the softmax function, where you allow arbitrary output values but then compute the sum of their Math.Exp() values and then divide each output by that sum. For example if some output from the neural net is (2.0 -3.0 0.0) their Math.Exp values are (7.39 0.05 1.00) which sum to 8.44 and dividing each by that sum gives a softmax activation output of (0.87 0.01 0.12). By the way, this computation is tricky and you have to guard against numeric overflow.

You can think of softmax outputs as probabilities. But now comparing a softmax output with a training output becomes somewhat of a problem if you use a standard sum of squared deviations (SSD) approach. For example suppose the softmax output is (0.87 0.01 0.12) as above and the training value is (1 0 0). The SSD would be computed as (1 – 0.87)^2 + (0 – 0.05)^2 + (0 – 1.00)^2. You’d compute all these values and sum them to get the total SSD error. Again there is a huge story here but the net conclusion is that it’s recommended to use cross entropy error instead of sum of squared deviations error. Cross entropy error, in principle, looks like this:

(1 * Ln(0.87)) + (0 * Ln(0.01)) + (0 * Ln(1.00))

= -0.06 + 0 + 0 = -0.06

You would add all the cross entropies for each training vector up and then multiply by -1. There’s a lot to understand here but in essence we only look at the 1 term in the training data (multiplying by the 0s has no effect on the sum). If the output is close to 1.0 the Ln is close to zero. To summarize, when using a neural network to classify categorical data, encode the output categorical data using 1-of-N encoding (except if there are only two categories), use the softmax activation function to generate output, and use cross entropy to measure error.