The neural network lottery ticket hypothesis was proposed in a 2019 research paper titled “The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks” by J. Frankle and M. Carbin. Their summary of the idea is:
We find that a standard pruning technique naturally uncovers subnetworks whose initializations made them capable of training effectively. Based on these results, we articulate the “lottery ticket hypothesis:” dense, randomly-initialized, feed-forward networks contain subnetworks (“winning tickets”) that – when trained in isolation – reach test accuracy comparable to the original network in a similar number of iterations. The winning tickets we find have won the initialization lottery: their connections have initial weights that make training particularly effective.
Let me summarize the idea in the way that I think about it:
Huge neural networks with many weights are extremely time consuming to train. It turns out that it’s possible to train a huge network, then prune away weights that don’t contribute much, and still get a model that predicts well.
The lottery ticket idea has limited usefulness because you start by training a gigantic neural network. Then you prune away some weights. This helps a bit at inference time when the trained model is used to make predictions, but running input through a trained model doesn’t usually take much time so not much is gained. The idea is useful from a theoretical point of view — knowing that huge neural networks can in fact be compressed without sacrificing very much prediction accuracy means that maybe it’s possible to find a good compressed neural network before training rather than after training.
I’ve seen three research ideas for compressing a neural network before training. The first paper is “SNIP: Single-Shot Network Pruning Based on Connection Sensitivity” (2019) by N. Lee, T. Ajanthan, and P. Torr. The idea is to run training data through network once, find gradients with small values, then delete the associated weights.
The second paper is is “Picking Winning Tickets Before Training by Preserving Gradient Flow” (2020) by C. Wang, G. Zhang, and R. Grosse. Their idea is basically a refinement of the SNIP paper. The idea is to use second derivatives to estimate the effect of dropping a weight after pruning, rather than before pruning as in the SNIP technique.
The third paper is “Initialization and Regularization of Factorized Neural Layers” (2021) by M. Khodak, N. Tenenholtz, L. Mackey, and N. Fusi. The idea is to factor each (large) weight matrix into two (smaller) weight matrices using singular value decomposition. The two smaller matrices of weights can be trained more quickly than the single large matrix of weights, but requires some tricky coding.
I speculate that at some point in the future, quantum computing will become commonplace, and when that happens, the need for compressing huge neural networks will go away. But until quantum computing arrives (and I think it will be later rather than sooner), work on compressing neural networks will continue.
The “lottery ticket hypothesis” phrase is catchy and memorable. But if you think about it carefully, the phrase really doesn’t have much to do with the ideas presented in the research paper. But researchers need to market and advertise their work just like anyone else. Here are three examples of product marketing names that didn’t turn out very well. Left: “Terror” brand liquid soap. Center: “Painapple Candy”. Right: “Tastes Like Grandma” jam.