Neural network dropout is a technique that can be applied to NN training to reduce the chances of model overfitting — you train too well and so the trained model predicts your training data nearly perfectly, but when presented with new data the model classifies poorly.

In dropout training, as each training item is presented, a random 50% of the hidden nodes are “dropped” — you ignore them. By doing this, you’re effectively taking many half-size networks, training them, and then averaging them.

While I was preparing to give a talk about neural network dropout, I created a demo program from scratch. That’s the way I a am sure I understand a topic. My approach was to modify code normal, non-dropout code so that when it was time to process a dropped hidden node, I’d skip over it. For example, in the forward-pass computation of the hidden nodes, code looks like:

for j in range(self.nh): if dropOut == True and \ self.dNodes[j] == 1: continue # drop! for i in range(self.ni): hSums[j] += self.iNodes[i] * self.ihWeights[i,j] hSums[j] += self.hBiases[j] # add the bias self.hNodes[j] = self.hypertan(hSums[j])

And I’d perform the same kind of checks in the back-propagation code, for example:

# compute hidden-to-output gradients for j in range(self.nh): if dropOut == True and \ self.dNodes[j] == 1: continue # drop! for k in range(self.no): hoGrads[j,k] = oSignals[k] * self.hNodes[j]

OK. But then I was reviewing some documentation for NN code libraries that can perform dropout, and also some online articles about dropout. Several times I read something along the lines of, “dropout sets a random 50% of hidden nodes to 0.0 values.”

Hmm, that seemed strange to me at first but then I convinced myself if the value of a hidden node is 0 then it’s exactly the same as if the node isn’t there. Surely documentation couldn’t be wrong?

Wait — not so fast. To test the set-to-zero approach I used a crude but effective technique: I coded up an implementation and compared my ignore-hidden-nodes version with the simpler set-hidden-nodes-to-zero approach. The two approaches produced different results. And the set-to-zero approach gave much worse results.

After examining the code carefully I realized the problem with the set-to-zero approach is when the derivative of the hidden node activation function is computed. If you use logistic sigmoid, the derivative is (hid[j]) * (1 – hid[j]) which, if hid[j] = 0 is 0 * 1 = 0 and so the gradient is zero, and then weight delta is zero, and so you’re OK.

But, if you use tanh activation, the derivative is (1 – hid[j]) * (1 + hid[j]) = 1 * 1 = 1 and so the gradient won’t be zero, and then the weight delta won’t be zero, so you’re not OK. Some other activation functions would give non-zero gradients / deltas too.

It is possible to use a set-hidden-node-to-zero approach when implementing dropout, but you’d have to take care of some additional details. The moral: in machine learning, everything is tricky.

If one uses dropout, you get less signal so one might double the signal and correct output elsewhere. however I have great doubts about dropout optimization itself.

As essentially its not finding the answer of the root cause.

Over-fitting is a result of using to many nodes, so the NN starts to act as a memory.

Which is not possible if the NN had the right amount of nodes to begin with. So finding the correct amount of nodes, is key to the answer.

Sure a dropout network will work, but with overhead in computation time (requires to many nodes). Maybe its nice for deep networks, but for the smaller networks (single hidden layer) you dont get a big gain from it. It’s much easier to extend your article about the sweep optimization. So you can sweep test trough NN configuration layouts ea : hidden-layer-node-count[ ] = {9,8,7,6,5,4,3}

To find the network with the least amount of nodes, and the validity score one likes.

I think dropout and its random ‘node killing’ are merely showing how well NN by definition intend to learn, and improve / heal themselves. A bit like your article weight vs signal optimization. One can alter a part of the network, but then the dualistic math behind NN still ‘heals’ from it.

Perhaps dropout strength is only for deep networks as way of adding noise in deep learning training methods. So a next layer would learn to solve with less detailed information (ea showing half of a cats face instead of a whole) enforcing the network to find the detail futures of a cats face. Since deep networks are mostly about 2D data inputs. So i can imagine that’s their usage.

Maybe if one had a deep network (i’m waiting on your article on that) it would it be interesting to not completely shut down 50% of the nodes, but rather add 30% of salt in the signal, so the deep layers can each evolve to more complex functions (as compared to pre-learned layers, i think google is using pre-learned layers in some of their research).