There are pros and cons about working at a huge company. One of the very best things about working at Microsoft is the research talks that happen every day on “resnet”. I gave a resnet talk recently on the topic of neural network dropout.

I spent quite a bit of time reviewing neural network fundamental concepts: the input-output mechanism, and the back-propagation training algorithm. Then I discussed the dropout technique where, as each training item is presented, a random 50% of the hidden nodes are selected and dropped as if they weren’t there.

This technique in effect samples sub-networks and then averages them together. The main idea is very simple, but like always with neural networks, there are many subtle details.

Also, when I gave my presentation, I tried to add peripheral information about the history and development of the technique, and a bit about the psychology that’s associated with machine learning research.

I gave the audience a few challenges. When the nodes to drop are selected, they’re always (in every example I’ve ever found anyway) selected randomly:

for-each hidden node
generate a random probability between 0 and 1
if p < 0.50 make curr node a drop node
end for-each

But this approach doesn’t guarantee that exactly half of the hidden nodes will be selected — if you have four hidden nodes you might get 0, 1, 2, 3, r 4 drop nodes. So the challenge was to write selection code that guarantees exactly half of the nodes are selected.

If using the Python language, one way to do this would be to use the random.sample() function. For example:

# sample.py
import random
print("\nBegin \n")
random.seed(0) # make reproducible
indices = list(range(0,10))
print(indices) # [0, 1, . . 9]
selected = random.sample(indices, 5)
print(selected) # 5 random indices
print("\nEnd \n")

I pointed out that, to the best of my knowledge, nobody has investigated and published an analysis if the two selection approaches give essentially the same results on neural network prediction accuracy.

### Like this:

Like Loading...

*Related*

hmm would it make a difference to really take out 50% of the nodes, or do it by randomly by 50% chance. randomly 50% over x epochs is about the same. Guess well its statistics here, what’s the chance one would take out all nodes in 50% chance mode (at some epoch all are zero)..

Like a roulette table going red for all hidden node id’s, I think that would kinda erase learned outcomes by 1/epoch influence that can be a little effect, depending on epoch. And also depending on how the nodes leak approach or train(near) a huge effect.

or thinking probalisitc what are the odds with a 50% chance exactly half of the nodes is activated, you’ll find way more often that around a third is not activated or two third is not activated (like gambling with 2 dices).. And over x epochs then how likely is it a certain node has x repetitive train epochs ?… effect might not be large for pre-learned layers, but if one had to start train from epoch 0; hmm… well sadly the only thing missing for me to test it out is a deep c# network .

btw besides a deep network another interesting NN subject is feedback loops.

where output at layer x is put back as input for layer y in a next epoch,

which is nice to learn time series with.