Why I Don’t Use Min-Max or Z-Score Normalization For Neural Networks

Normalization is the process of scaling numeric predictor values so that they’re all roughly in the same range, typically 0.0 to 1.0 (min-max normalization) or about -4.0 to +4.0 (z-score normalization). Over the past few years I have become quite convinced that neither of these techniques is as good as simple order-magnitude normalization. The ideas are subtle and are best explained by a concrete example.

Note: Encoding is the process of converting non-numeric data, such as a color variable that can be red, blue, or green into numeric values such as (1, 0, 0), (0, 1, 0), (0, 0, 1). Encoding is a separate topic.

Suppose you have just 10 data items where the predictors are a person’s age and some sort of money account balance:

age    balance   val-to-predict
24      450.00        y
36     2750.00        y
57      367.00        y
28    -1730.00        y
44      310.00        y
39      803.00        y
50      189.00        y
47     4120.00        y
29     -203.00        y
31      402.00        y

If you’re going to feed this data to a neural network system, you want to normalize the predictor data so that the large balance values don’t overwhelm the smaller age values. (There are a few ML techniques that don’t need normalization, but most do.)

If you want to use min-max normalization, first you must do a random train-test split, and then find the min and max of only the train data and then use those two values on the train and test data to compute x’ = (x – min) / (max – min) on the train data and then on the test data. You should not use any knowledge about the test data for normalization!

There are several significant problems here. First, the min and max values will depend on how the data was split, and so no two people will normalize a dataset in the same way. Second, you must record and save the min and max values so you can use them to normalize new, previously unseen data on which you wish to make a prediction. And the min and max values are likely to be very ugly and hard-to-remember, like min = 42.3333, max = 57.1825. Third, you lose the plus-minus sign information, which is sometimes very important. Fourth, the normalized scores are difficult to interpret — what is the actual age of a person whose normalized age is 0.35?

If you use z-score normalization, you have the same problems as min-max normalization.

The normalization technique I prefer doesn’t have a standard name, but I call it order-magnitude normalization. It’s super simple: you divide each predictor column by 10 or 100 or 1000 or etc., so that the magnitude of each predictor is between 0.0 and 1.0. For the age values above, you’d divide each by 100. For the balance values, you’d divide each by 10,000.

age    balance   val-to-predict
0.24    0.0450        y
0.36    0.2750        y
0.57    0.0367        y
0.28   -0.1730        y
0.44    0.0310        y
0.39    0.0803        y
0.50    0.0189        y
0.47    0.4120        y
0.29   -0.0203        y
0.31    0.0402        y

First, notice that this technique does not depend on how the data is split into train-test, or train-validate-test. Therefore you can normalize the source data and then split — a huge savings in effort, trust me. Second, you only have to record a single vector of divisor values, such as (10, 1, 100, 1, 10) which is much simpler to use when normalizing new data items. Third, the order-magnitude normalization technique maintains the plus-minus sign information of the source data. Fourth, normalized values are easy to interpret — a normalized age of 0.35 is 35 years old.

The order-magnitude normalization technique is almost never used by anyone except me, or employees at my workplace who have taken a technical training class from me. In fact, I have never seen the technique used in any research or system (although I’m sure it must be used sometimes).

Does this mean I’m wrong about the advantages of order-magnitude normalization compared to min-max or z-score normalization? I don’t think so. Unfortunately, in machine learning, topics are so complicated that everyone who learns neural-based ML accepts traditional lore as unimpeachable gospel truth. This is understandable because anyone who is starting to learn ML is swamped by complex information.

But in neural ML, sometimes a “fact that everyone knows is true” — such as “you should normalize your numeric data using min-max or z-score” — is just not true at all.

Common knowledge that everyone knows is true: Dogs 1.) are inherently able to catch frisbees, 2.) are always great at obstacle courses, 3.) are natural enemies of cats.

This entry was posted in Machine Learning. Bookmark the permalink.

2 Responses to Why I Don’t Use Min-Max or Z-Score Normalization For Neural Networks

Thorsten Kleppe says:

August 16, 2021 at 8:31 am

Your recommendation on normalization has always helped me, especially with real data.

With the softmax function, we actually do the opposite of what you recommend here in the post. We take some values and squeeze them into a min-max range between zero and one. This looks great, but we are creating problems that in most cases we don’t even know exist because we have hidden them. In this way, scale invariance is lost.

The history of activation functions shows how badly the ML crowd has failed for decades. Knowledge had to form laboriously over a long time and fight its way from: Sigmoid -> Tanh -> Relu. An activation function that is better than Relu, if it will exist, will still have the properties of Relu that are incredibly simpler than a linear activation function that spoils all the fun.

The loss function is another example, is it really useful? On a prediction heatmap, you create very hard transitions when combined with softmax. When training, if you avoid examples that are above the average probability of Softmax, the heatmap becomes very soft, even though the threshold between classes can be the same. But it gives you the opportunity to add more data points on the imaginary graph without having to change the stone construction of the numbers too much. The neural network makes its predictions in exactly the same way and the shape remains malleable, which is also called regularization.

One thing that often bothers me is unexpected implementations. What do you do when you implement a technique and ask yourself, “Is this right?”
And how do you react when the “right” implementation performs much worse in testing than a “technically wrong” implementation you encountered along the way? Assuming you even get to the point where you can say that with certainty.

Today we are working with Python, but is it any good? No, of course not! We use it for the tasks that require the most computing power, but there are few programming languages that would be even more unsuitable for those tasks.

The argument for leaving the computation to the GPU comes from NVIDIA. Nevertheless, the path CPU -> GPU -> CPU is always the same. And to make it worthwhile, most images have to have a very high resolution. However, most small tasks can be solved much better with the CPU. If you believe the internet, you could assume that it has to be programmed with C++. But if you compare the time to read the data, here is always the biggest bottleneck.

For me, C# actually ran faster than C++ because of the read-in time, but if you do the test and run C# in debug mode instead of release mode, you’re more likely to complain why C# is so slow. There are also spacy algorithms for reading data that would probably make C++ faster. But let’s face it, in general people are unlikely to use highly specialized algorithms as most of them are completely unknown and can only be understood by an absolute expert.

On youtube there is a very interesting video:
“How Do Neural Networks Grow Smarter? – with Robin Hiesinger”
After that, you might get the idea that we haven’t even really understood the perceptron.

I was very impressed by the post about poor explanations of graphs with x and y coordinates. It is not possible to represent anything with only 2 values without hidden neurons, except a line separating two classes. In this, the representation is not as wrong as the explanation if you use feature engineering, but I didn’t find anything about that in my search.

The banknote dataset is perhaps my favorite example of the confusions that already have something of the “Asch paradigm”. Remember in binary problems you just get the solution 100% wrong and then flip the sign, right. But a little scary.

It’s crazy and would have to be listed once to make it into the top ten, but that’s actually your job. If you ask me, people don’t read your blog enough. I like the term research, it gives us a mission to search again. That’s what I found in your blog, James.
Pingback: Preparing the Boston Housing Dataset for PyTorch | James D. McCaffrey