Why You Should Not Use Neural Network Label Smoothing

Neural network label smoothing is a technique to prevent model overfitting. I never use label smoothing (LS) because:

1. LS introduces a new hyperparameter, which makes a complex system more complex, and results less interpretable.
2. LS modifies data, which is conceptually offensive and problematic in practice.
3. You can achieve a roughly equivalent LS effect by using weight decay or L1/L2 regularization.

I’ll explain label smoothing by using an example. Suppose you create a neural network classifier where there are three possible outcomes, for example, the Iris dataset where the three species to predict are setosa or versicolor or virginica. Your training data might look like:

5.1, 3.5, 1.4, 0.2,  1, 0, 0  # setosa
7.0, 3.2, 4.7, 1.4,  0, 1, 0  # versicolor
6.3, 2.9, 5.6, 1.8,  0, 0, 1  # virginica
. . .

The first four values on each line are predictors and the next three values are one-hot encoded species. An example of label smooting is to modify the training data to use “soft targets” like so:

5.1, 3.5, 1.4, 0.2,  0.8, 0.1, 0.1  # setosa
7.0, 3.2, 4.7, 1.4,  0.1, 0.8, 0.1  # versicolor
6.3, 2.9, 5.6, 1.8,  0.1, 0.1, 0.8  # virginica
. . .

This label smoothing approach sometimes reduces model overfitting so that when the trained model is presented with new, previously unseen data, the prediction accuracy is better than if you don’t use label smoothing.

Here’s a brief, hand-waving argument of what happens when you use LS training data. First, without LS, imagine you are updating the middle output node and the target value is 1 and the computed output value is 0.75 — you want to increase the weights that are connected to the node so that the computed output will increase and get closer to the target of 1.

Regardless of whether you are using cross entropy error or mean squared error, a weight delta is computed using the calculus derivative of the error function, and that delta always contains the error term (target – output), which is (1 – 0.75) = 0.25. That error will be modified by the learning rate, so if the learning rate is 0.01 the delta will contain 0.25 * 0.01 = 0.0025 and the weight will increase slightly.

Now on the next training iteration, suppose the computed output is 0.97. The error term is (1 – 0.97) = 0.03 and the delta will contain 0.03 * 0.01 = 0.0003 and the weight will increase but only by a tiny amount.

The ultimate effect of this training approach is that weight values could get very large, and large weight values sometimes give an overfitted model.

Now, suppose you’re using label smoothing. If the computed output is 0.75, the error term is (target – output) = (0.8 – 0.75) = 0.05 and the weight delta will contain 0.05 * 0.01 = 0.0005 and the weight will increase, but only by a small amount. Now on the next iteration, if the computed output is 0.97 the error term is (0.8 – 0.97) = -0.17 and the delta will contain -0.17 * 0.01 = -0.0017 and the weight value will decrease slightly.

The ultimate effect of the label smoothing approach is that weight values are usually prevented from getting very large, which can help prevent model overfitting.

Let me emphasize that this hand-waving argument has left out many important details.

OK. First problem with label smoothing: Where did the (0.1, 0.8, 0.1) soft targets come from? Why not (0.15, 0.70, 0.15) or (0.2, 0.6, 0.2) or something else? There’s no good answer to this question. Mathematically, label smoothing is usually presented as:

t’ = (1-a) * t + (a/K)

where t’ is the soft target, t is the original hard target (0 or 1), K is the number of classes, and a is any value between 0.0 and 1.0. For example, if a = 0.10 and K = 3, then a hard target of 1 becomes (1 – 0.10) * 1 + (0.10 / 3) = 0.9333 and the two 0 hard targets become 0.0333 each.

But this apparently sophisticated math basis is a hoax because there’s no good way to choose the value of a. In other words, the label smoothing values can be whatever you want. Ugly.

The second problem with label smoothing is that because the effect of LS is to restrict the magnitude of weight values, there are other simpler techniques that do this, such as weight decay, L1 regularization, and L2 regularization. Now, it’s true that these techniques don’t work exactly the same as LS, but the general principle is the same.

Finally, the worst problem with label smoothing in my opinion is that you are changing data. Philosophically this is just ugly, ugly, ugly. It’s true that you don’t have to physically change the training data — instead you can programmatically change the hard target values to label smoothed soft target values during training. But modifying data is almost always just wrong.

Let me wrap up by saying that when I did my research on label smoothing for this blog post, I was horrified by what I found on the Internet. Almost every blog post and short article, and even many formal research papers, had significant errors.

For example, almost all references either imply or explicitly state that there’s a necessary relation between label smoothing and cross entropy error. This is not correct. You can use label smoothing with cross entropy error or mean squared error or any other kind of error. When you use some form of error, the back-propagation technique uses the calculus derivative of the error function, not the error function itself, to compute a weight update delta value. The weight update term for all error functions contains a (target – output) term, and that term is the only place where label smoothing comes into play. For details, see my post at https://jamesmccaffrey.wordpress.com/2019/09/23/neural-network-back-propagation-weight-update-equation-mean-squared-error-vs-cross-entropy-error/.

I also read several Internet label smoothing articles that talked about “confidence” and “calibration” that were complete technical nonsense.

Incidentally, label smoothing has been around since at least the mid 1980s when it wasn’t uncommon to use 0.9 and 0.1 instead of 1 and 0 for binary classification. This is exactly equivalent to label smoothing with K = 2 and a = 0.2. It seems like the technique was forgotten in the late 1990s but then was “rediscovered” in the mid 2010s.

Thank you to my colleague Hyrum A. who pointed out a recent research paper that looked at label smoothing.

“Smooth douglasia” – a relatively rare wildflower that grows in the Pacific Northwest. “Smooth Operator” – a 1984 song by a British group called Sade. “Antelope Smooth Red Rock Canyon” – a beautiful slot canyon in Arizona. “Smooth haired dachshund” – originally bred in the early 1700s to hunt burrow-dwelling animals like badgers and rabbits. This dachshund puppy doesn’t look very threatening to burrow-dwelling animals or anything else.

1 Response to Why You Should Not Use Neural Network Label Smoothing

Thorsten Kleppe says:

October 4, 2020 at 3:25 am

What when we take the pattern of the last hidden layer as a form of label smoothing?

A neural network 784-5-10 can predict the MNIST data pretty well. Different numbers as inputs have different activation patterns. For simplicity the ReLU activation is 0 or 1 in this example.

Hidden neurons activation level after training for input samples (0 – 9)
h = 1 2 3 4 5
(input number = hidden pattern = pattern length)

0 = 1 0 1 1 1 = 4
1 = 0 1 0 1 0 = 2
2 = 0 1 1 1 1 = 4
3 = 1 1 1 1 1 = 5
4 = 1 0 0 0 0 = 1
5 = 1 0 0 1 0 = 2
6 = 1 0 1 1 0 = 3
7 = 0 0 0 0 1 = 1
8 = 1 0 0 1 1 = 3
9 = 1 0 0 0 1 = 2
——————————
7 3 4 7 6 <- activation level for hidden neuron

A prediction with only one of the trained hidden neurons:
h1 h2 h3 h4 h5
(4) (0+1) (6) (8) (7)
Hidden neuron 2 alone can predict 2 classes, 0 and 1, the winner neuron.

When one of the 5 hidden neurons is dropped:
h1 takes out prediction for 0, 3, 4, 5, 9 = 5
h2 takes out prediction for 1, 3 = 2
h3 takes out prediction for 0, 2, 3, 5, 6 = 5
h4 takes out prediction for 0, 3, 4, 5, 8 = 5
h5 takes out prediction for 0, 7, 9 = 3

It was really cool to see that and how everything is changing.
The comparison of similar numbers makes sense in a intuitiv way to me.
Pattern 8 = 1 0 0 1 1 and 9 = 1 0 0 0 1 differs most in hidden neuron 4,
The 0 pattern is more close to the pattern 8 as for p9, and p2 is very close to p3 and so on.

Instead of using a label like a target value 1 for the target number or the label smoothing technique, we could take the pattern as a more smooth and differentiable label for the hidden neurons in front of the output neurons.
A more realistic pattern of 8 could look like (2.3, 0.0, 0.0, 4.6, 0.9), this could act as smooth label.
The new label could be created from the neural network and is more close to what the neural network really want.

I hope this post makes sense in a way.