I was presenting a talk about the Microsoft CNTK neural network library recently. CNTK has a quirk that isn’t explained anywhere in the documentation. Briefly, unless you’re careful, when you train a CNTK neural network classifier, you will apply softmax activation twice. Your training will still work but training will be much slower than it should be, which means that your resulting model may not be as accurate as it should be.
Consider this (incorrect) CNTK code snippet:
h_layer = C.layers.Dense(hidden_dim, activation=C.ops.tanh, name='hidden_layer')(X) o_layer = C.layers.Dense(output_dim, activation= C.ops.softmax, name='out_layer')(h_layer) model = o_layer . . . tr_loss = C.cross_entropy_with_softmax(model, Y) . . . (train model) . . . (use model to make predictions)
In all neural libraries except CNTK, you apply softmax activation to the output layer to coerce the raw output values so that they sum to 1.0 and can be interpreted as probabilities. But CNTK doesn’t have a basic cross entropy loss function; it only has a cross entropy with softmax loss function. This means that in the snippet above softmax is applied twice.
Why is this a problem? In the image below, imagine you have a neural network classifier where one training item has correct output of (0, 1, 0). The computed output values are (1.5, 3.5, 2.5) and if softmax is applied the output values become (0.0900, 0.6652, 0.2447). If you stopped here and used regular cross entropy loss, you have nice separation between the output values.
But, if you apply softmax a second time, the output values become (0.2535, 0.4506, 0.2959). The values still sum to 1.0 and the middle value is still largest, but the three output values are now much closer to each other. With less separation, the training algorithm will improve more slowly.
So, the correct way to implement a CNTK classifier is:
h_layer = C.layers.Dense(hidden_dim, activation=C.ops.tanh, name='hidden_layer')(X) o_layer = C.layers.Dense(output_dim, activation= None, name='out_layer')(h_layer) nnet = o_layer model = C.ops.softmax(nnet) . . . tr_loss = C.cross_entropy_with_softmax(nnet, Y) . . . (train nnet) . . . (use model to make predictions)
The idea is to train an un-softmaxed nnet object using cross entropy with softmax, so softmax is applied only once. Then after training, you use the parallel model object which does have softmax.