L2 Regularization and Back-Propagation

L2 regularization, also called weight decay, is simple but difficult to explain because there are many interrelated ideas. Briefly, L2 regularization (also called weight decay as I’ll explain shortly) is a technique that is intended to reduce the effect of neural network (or similar machine learning math equation-based models) overfitting.

So, to really understand the “why” of L2 regularization, you have to understand neural networks weights and training, and such an explanation would take a couple of pages at least. Moving on, NN overfitting is often characterized by weight values that are very large in magnitude. The main idea of L2 regularization is to reduce the magnitude of weights to reduce overfitting.

Every math-based model requires training, which is the process of using data that has known inputs and known correct outputs, to find the values of the weights and biases (special weights). When training, the training optimization algorithm, for example, back-propagation or swarm optimization, needs a measure of error. L2 regularization adds a factor which is a fraction of the sum of the squared weights, to the error term. Therefore, larger weight values will contribute to larger error, and so smaller weights will be rewarded.

Note that at this point, to fully grasp L2 regularization, you must also understand how training error is measured and how training optimization algorithms work, which, again, would take several pages of explanation.

And now things get really messy. In back-propagation training, the basic weight update expressed in an equation is:

After doing some math that involves taking the derivative of the error function (also called the cost function), the update when using L2 regularization becomes:

Note: And before I forget, when using L2 regularization, the update equation for the bias values doesn’t change — a small detail that can cause a lot of grief if you’re writing code and don’t pay attention.

In words, when using back-propagation with L2 regularization, when adjusting a weight value, first reduce the weight by a factor of 1 – (eta * lambda) / n (where eta is the learning rate, lambda is the L2 regularization constant, and n is the number of training items involved (n = 1 for “online” learning), then subtract eta times the partial derivative (loosely referred to as the gradient) of the cost (error) function. The weight values tend to decrease, or “decay”, during training.

And, sadly, the messy details continue. When implementing L2 regularization, instead of adjusting weights according to the math equations, you can simplify the code to adjust the weight as normal without L2 regularization, and then subtract a fraction of the original weight value. This approach reduces weight values but completely changes the meaning of the lambda constant.

And, sigh, things get messier if you consider training algorithms such as particle swarm optimization that directly use the error term, rather than algorithms such as back-propagation that use error indirectly (to calculate the gradient).

Well, if you’re reading this blog post because you want to understand L2 regularization, all these complications are probably a bit depressing. But the basic idea is simple: L2 regularization reduces weight values which reduces model overfitting. L2 regularization is actually very simple, but the difficulty is that a full understanding of L2 regularization requires a full understanding of virtually every neural network concept.

The good news is that completely understanding L2 regularization is possible — you just have to understand all the related concepts.