I wrote an article titled “Data Prep for Machine Learning: Normalization” in the August 2020 edition of the online Microsoft Visual Studio magazine. See https://visualstudiomagazine.1105cms01.com/articles/2020/08/04/ml-data-prep-normalization.aspx.
The article is one of a series that explains how to programmatically prepare data for use in a PyTorch neural network. Suppose you want to predict a person’s political leaning (conservative, moderate, liberal) from predictors such as sex, age, income, region of residence, and so on. The idea of normalization is to scale the numeric predictor values so that all the values are roughly in the same range, so that the large values (such as a person’s annual income) don’t overwhelm the smaller values (such as a person’s age).
Click to enlarge.
In situations where the source data file is small, about 500 lines or less, you can usually normalize numeric data manually using a text editor or spreadsheet. But in almost all realistic scenarios with large datasets you must normalize your data programmatically.
There are several different types of data normalization. The three most common types are min-max normalization, z-score normalization, and constant factor normalization. In my article, I present a complete end-to-end demo program that uses min-max normalization. And I explain how the demo program can be easily modified to use z-score or constant factor normalization.
In theory, it’s not necessary to normalize numeric data for training a neural network. The idea is that the network weights and biases will adapt during training to handle differently scaled predictor values. But in practice, data normalization is usually necessary to get a good prediction model.
For numeric data clustering algorithms, such as k-means variants, clustering is usually essential. These clustering algorithms are based on a distance metric. If data is not normalized, variables with large magnitudes (such as annual income) will dominate variables with smaller magnitudes (such as age). Without normalization the clustering will be effectively based on just the variable which has values with the largest magnitudes.
Uniforms are a form of people normalization. The police forces of some countries have very handsome uniforms. Left: Royal Canadian Mounted Police. Center: Russian police. Right: Italian Carabinieri police.