## L1 and L2 Regularization for Machine Learning

I wrote an article titled, “L1 and L2 Regularization for Machine Learning” in the January 2015 issue of Microsoft MSDN Magazine. See https://msdn.microsoft.com/en-us/magazine/dn904675.aspx.

The most difficult part of L1 and L2 regularization is understanding what they are, as opposed to understanding how to write code that implements them. Briefly, many forms of machine learning are essentially math equations that can be used to make predictions. The two most prominent examples are called neural network classification, and logistic regression classification. The underlying math equations have numeric constants, like 3.45, that are called weights.

Training a classifier is the process of finding the values of the weights. This is done by using a set of training data that has known input values and output values. Training tries different values for the weights so that, for the training data, the computed outputs closely match the known correct outputs.

Unfortunately, if you train long enough it’s almost always possible to find a set of values for the weights so that the computed outputs match the training outputs almost perfectly. But when you use the weights on new, previously unseen data with unknown output values, to make predictions, the predictions are very poor. This is called over-fitting – the weights fit the training data too well.

One characteristic of weight values that are over-fitted is that the values tend to be large. L1 and L2 regularization restrict the values of the weights. L1 regularization penalizes the sum of the absolute values of the weights. L2 regularization penalizes the sum of the squared values of the weights.

L1 regularization sometimes has a nice side effect of pruning out unneeded features by setting their associated weights to 0.0 but L1 regularization doesn’t easily work with all forms of training. L2 regularization works with all forms of training, but doesn’t give you implicit feature selection. In practice, you must use trial and error to determine which form of regularization (or neither) is better for a particular problem.