Neural networks can be huge. A neural network with millions or billions of weights and biases (“trainable parameters”) can take weeks to train, which would cost a lot of money, and emit a lot of CO2 from the energy consumption.
For decades, researchers have explored various techniques to reduce the size of neural networks. Many of these size reduction techniques are nearly useless in practice because they start by training a huge network and then after training, pruning away network weights that don’t contribute much to the output. A promising technique that reduces the size of a neural network before training is called single-shot network pruning at initialization.
Suppose you have a 3-4-2 neural network. If it is fully connected, it will have (3 * 4) + (4 * 2) = 20 weights. It will also have 4 + 2 = 8 special weights called the biases, which I’ll ignore to keep the explanation clear. Each weight is just a constant. The values of the weights determine the output of a neural network. The process of finding good values for the weights is called training the model.
During training, each weight has an associated value called a gradient. A gradient is just a number. The gradient for a weight changes on each training iteration. The positive or negative sign of a gradient instructs the training code to either increase or decrease the associated weight in order to reduce network error. The magnitude of a gradient tells the training code how much to adjust the associated weight.
A reduced neural network has some of the weights removed but the reduced network should still compute nearly the same output as the uncompressed network. The single-shot network pruning at initialization technique attempts to identify weights that don’t contribute much — before training the network.
The technique is very simple. Before training, the network’s weights are initialized to small random values using one of several techniques. Then all training data is fed to the network and the gradients associated with each weight are computed. Then each gradient is normalized by dividing by the sum of the absolute values of the gradients. Network weights that have small normalized preliminary gradient values won’t change much and so those weights are removed.
For example, suppose there are just eight weights in a fully connected neural network. On the preliminary pass, suppose the eight gradients are (1.2, -2.4, 0.8, 3.6, -1.8, 0.4, 2.8, 1.4). The sum of the absolute values is 1.2 + 2.4 + 0.8, + 3.6 + 1.8 + 0.4 + 2.8 + 1.4 = 14.4. The normalized gradients are (1.2/14.4, 2.4/14.4, . . . 1.4/14.4) = (0.08, 0.16, 0.06, 0.25, 0.13, 0.03, 0.19, 0.10). If you decide to reduce the network size by 50%, you’d drop the weights associated with the four smallest normalized gradients values: (0.03, 0.06, 0.08, 0.10).
The original research paper that introduced the single-shot network pruning at initialization technique is “SNIP: Single-Shot Network Pruning Based on Connection Sensitivity” (2019) by N. Lee, T. Ajanthan, and P. Torr.