There is a lot of contradictory information on the Internet about the theory and implementation of L2 regularization for neural networks.
Regularization is a technique designed to counter neural network over-fitting. Over-fitting occurs when you train a neural network too well and it predicts almost perfectly on your training data, but predicts poorly on any data not used for training.
As it turns out, a neural network that is over-fitted, usually has weight values that are large in magnitude (say 538.1234 or -1098.5921) rather than small (say 3.8392 or -2.0944). So regularization attempts to keep weights small.
The two most common forms of regularization are called L1 and L2. In L2 regularization (the most common of the two forms), you modify the error function you use during training to include an additional term that adds a fraction (usually given Greek letter lower case lambda) of the sum of the squared values of the weights. So larger weight values lead to larger error, and therefore the training algorithm favors and generates small weight values.
In the equations below, the Error function has a plain, mean squared error between target (t) and computed output (o) values. Then a weight penalty corresponding to the squared weight values is added. The regular weight update rule is a small learning rate (like 0.05) times the error gradient, which is the Calculus derivative of the Error function.
The Calculus derivative of the mean squared error term is a bit difficult to derive because it uses the Chain Rule, but the derivation is well-known and you can find it in many places on the Internet. Because the derivative of a sum is the sum of the derivatives, to get the derivative of the augmented Error function, all you have to do is add the derivative of the “squared” function.
If y = x^2 then the derivative is y’ = 2*x. The exponent jumps down in front of the x. Notice that the additional weight penalty has a lambda / 2 term rather than just lambda — the 2 there only so it cancels the exponent 2. Therefore the additional term in the gradient is just (lambda * w).
Note that using this approach a typical value for lambda might be something like 0.03 so you penalize 3% of the current weight value.
A straightforward implementation of L2 regularization that follows the math definition just adjusts the calculation of the weight gradients (it’s standard not to apply regularization to bias values, but that’s another topic) and then updates weights as usual. For example, the calculation of the gradients for hidden-to-output nodes, in Python, looks something like:
But there are some important additional details.
Because L2 regularization just drives weights towards zero each training iteration (unless countered by the other part of the gradient), you can implement L2 regularization in a completely different way. Instead of modifying the calculation of the gradient, you can just decrease by a fraction of the weight first, and then apply the learning rate times the normal (not augmented by the weight penalty) gradient:
for-each weight weight -= lambda * weight # toward 0 weight += learnRate * gradient # regular end-for
But notice that using this approach, the lambda value might be something like 0.97 so you decrease a weight by a little bit each iteration. Compare to the straightforward approach where lambda is something like 0.03 — the point is that reasonable values for lambda can be very small or very large, and a reasonable value depends on how L2 is implemented. Unfortunately, most NN libraries don’t tell you which implementation they use, so guessing a good lambda for L2 can be annoyingly difficult.
And there are other ways to implement L2 too. The moral of the story is that even relatively simple things like neural network L2 regularization can be tricky because there are so many different implementation possibilities, and in code libraries, the implementation details are almost never clearly explained.