Neural network dropout is a technique that can be applied to NN training to reduce the chances of model overfitting — you train too well and so the trained model predicts your training data nearly perfectly, but when presented with new data the model classifies poorly.
In dropout training, as each training item is presented, a random 50% of the hidden nodes are “dropped” — you ignore them. By doing this, you’re effectively taking many half-size networks, training them, and then averaging them.
While I was preparing to give a talk about neural network dropout, I created a demo program from scratch. That’s the way I a am sure I understand a topic. My approach was to modify code normal, non-dropout code so that when it was time to process a dropped hidden node, I’d skip over it. For example, in the forward-pass computation of the hidden nodes, code looks like:
for j in range(self.nh): if dropOut == True and \ self.dNodes[j] == 1: continue # drop! for i in range(self.ni): hSums[j] += self.iNodes[i] * self.ihWeights[i,j] hSums[j] += self.hBiases[j] # add the bias self.hNodes[j] = self.hypertan(hSums[j])
And I’d perform the same kind of checks in the back-propagation code, for example:
# compute hidden-to-output gradients for j in range(self.nh): if dropOut == True and \ self.dNodes[j] == 1: continue # drop! for k in range(self.no): hoGrads[j,k] = oSignals[k] * self.hNodes[j]
OK. But then I was reviewing some documentation for NN code libraries that can perform dropout, and also some online articles about dropout. Several times I read something along the lines of, “dropout sets a random 50% of hidden nodes to 0.0 values.”
Hmm, that seemed strange to me at first but then I convinced myself if the value of a hidden node is 0 then it’s exactly the same as if the node isn’t there. Surely documentation couldn’t be wrong?
Wait — not so fast. To test the set-to-zero approach I used a crude but effective technique: I coded up an implementation and compared my ignore-hidden-nodes version with the simpler set-hidden-nodes-to-zero approach. The two approaches produced different results. And the set-to-zero approach gave much worse results.
After examining the code carefully I realized the problem with the set-to-zero approach is when the derivative of the hidden node activation function is computed. If you use logistic sigmoid, the derivative is (hid[j]) * (1 – hid[j]) which, if hid[j] = 0 is 0 * 1 = 0 and so the gradient is zero, and then weight delta is zero, and so you’re OK.
But, if you use tanh activation, the derivative is (1 – hid[j]) * (1 + hid[j]) = 1 * 1 = 1 and so the gradient won’t be zero, and then the weight delta won’t be zero, so you’re not OK. Some other activation functions would give non-zero gradients / deltas too.
It is possible to use a set-hidden-node-to-zero approach when implementing dropout, but you’d have to take care of some additional details. The moral: in machine learning, everything is tricky.