There is plenty of information about the theory of L1 regularization for neural networks, but I couldn’t find any information at all about how to actually implement L1 regularization.
Regularization is a technique designed to counter neural network over-fitting. Over-fitting occurs when you train a neural network too well and it predicts almost perfectly on your training data, but predicts poorly on any data not used for training.
As it turns out, a neural network that is over-fitted, usually has weight values that are large in magnitude (like 538.1234 or -1098.5921) rather than small (like 3.8392 or -2.0944). So regularization attempts to keep weights small.
The two most common forms of regularization are called L1 and L2. In L1 regularization, you modify the error function you use during training to include an additional term that adds a fraction (usually given Greek letter lower case lambda) of the sum of the absolute values of the weights. So larger weight values lead to larger error, and therefore the training algorithm favors and generates small weight values.
In the equations below, the Error function has a plain, mean squared error between target (t) and computed output (o) values. Then a weight penalty is added. The regular weight update rule is a small learning rate (like 0.05) times the error gradient, which is the Calculus derivative of the Error function.
The Calculus derivative of the mean squared error term is a bit difficult to derive because it uses the Chain Rule, but the derivation is well-known and doesn’t change. Because the derivative of a sum is the sum of the derivatives, to get the derivative of the augmented Error function, all you have to do is add the derivative of the absolute value function.
Alas, the absolute value function is shaped like a “V” and does not have a derivative at w = 0. But this really isn’t a problem. When w is positive, the derivative (slope) is +1.0 and when w is negative, the derivative (slope) is -1.0, and when w is zero, you don’t care what the slope is because your goal is to drive weight values towards zero. Therefore, if a weight is positive you add lambda to the gradient and if a weight is negative you subtract lambda. Too easy!
So, a straightforward implementation of L1 regularization just adjusts the calculation of the weight gradients (it’s standard not to apply regularization to bias values, but that’s another topic) and then updates weights as usual. For example, the calculation of the gradients for hidden-to-output nodes looks something like:
BUT. There’s always a “but” with neural networks. Because L1 regularization just drives weights towards zero by a constant amount each iteration, you can implement L1 regularization in a completely different way. Instead of modifying the calculation of the gradient, you can just add or subtract a constant value to the weight first, and then apply the learning rate times the normal gradient:
for-each weight weight = weight + (sgn(weight) * lambda) # L1 weight = weight + (learnRate * gradient) # regular end-for
The moral of the story is that even simple things like L1 regularization can be tricky because there are so many different implementation possibilities, and in code libraries, the implementation details are almost never clearly explained.