Regularization is a standard technique used in neural network training. The most common form of regularization is called L2. The idea is to add the sum of squared (the “2” in “L2”) weight values to the error term during training. This penalty acts to reduce the magnitude of the weights, which in turn acts to reduce the possibility of model over-fitting.
I coded up a demo using the Python language so I could gain a full understanding of L2 regularization. During my preliminary research, I found a lot of confusing and contradictory information on the Internet.
For example, when using L2, according to several sources, the theoretical weight update equation is:
Here eta (like script lower case n) is the learning rate, and lambda (like a triangle without the bottom) is a regularization constant, and n is the number of training items. It doesn’t make any sense that the weight penalty should depend on the number of training items.
Several sources say that regularization should not be applied to the bias values. This too doesn’t make any sense to me. Biases can grow very big so why not restrict them?
And several resources simplify the weight update equation to:
where d is a decay constant with a value like 0.99. But when I tried this approach with some synthetic data, all the weight values quickly went to 0.0 and training completely stalled.
In my demo, the approach that finally seemed to work best was to use a “conditional decay”, where weights are decayed using the simple equation but only when the absolute value of a weight is greater than 1.0 (which was arbitrary — perhaps a larger threshold would work better). And I decayed both weights and biases.
The moral is that even though there is a lot of information about neural network L2 regularization available on the Internet, I’m skeptical of a lot of that info.