Until quite recently, neural network libraries like TensorFlow and CNTK didn’t exist, so if you wanted to create a neural network, you’d have to do so by writing raw code using C/C++ or C# or Java or similar.
In those days, to implement neural network dropout, you’d do so by writing code to tag nodes as those to be dropped on each training iteration, and then directly editing the code that computes output (skipping drop nodes), and then directly editing the back-propagation training code, and then modifying the final weights to account for the fact that dropout was used during training.
The approach I just described was a bit tricky, but not quite as difficult as the description may sound. But still, in the old days (like 2-3 years ago), almost everything about writing neural network code was non-trivial.
So, my point is, I really, really understand dropout because I’ve read the source research papers, and I’ve implemented dropout from scratch many times.
Then in 2015 and 2016, along come TensorFlow and Keras and CNTK and other libraries. The approach used by these libraries is quite simple. Instead of creating a custom network, you place a so-called dropout layer into the network. The dropout layer sets its input nodes to 0.0 which effectively drops the associated nodes before those in the dropout layer.
Library code could resemble:
model = Sequential() # not real code model.add(Dense(4)) # input model.add(Dense(6)) # hidden model.add(Dropout(rate=0.5)) # apply to hidden model.add(Dense(3)) # output
The only way I could fully understand this mechanism was to sketch out a few pictures. Notice if you place a dropout layer immediately after the input layer, you are dropping input values, which is sometimes called jittering (although jittering can also mean adding noise to input values). If you place a dropout layer after the output layer, you’re dropping output values — which doesn’t make sense in any scenario I’ve ever seen.
I don’t think there’s a moral to this story. But an analogy might be something like this: In the 1920s and 1930s, everyone who drove a car probably had to have pretty good knowledge of how cars worked, so that they could fix the cars when they broke. But as time went on, understanding things like how to adjust the ignition timing became less and less important. Maybe that’s true of deep neural networks.
But it’s still good to know how things work.
To the best of my knowledge, the idea of dropout (but not the term ‘dropout’) was introduced in a 2012 research paper, and the first use of the term ‘dropout’ occurred in a 2014 follow-up paper. Dropout became widely known in late 2015. There are a couple of very deep research papers about the mathematics behind dropout (and how it averages virtual sub-networks). The best explanation for me is in a paper at: https://pdfs.semanticscholar.org/58b5/0c02dd0e688e0d1f630daf9afc1fe585be4c.pdf