Data Normalization and Encoding for Neural Networks in a Nutshell

I was giving a workshop on machine learning recently. There were a lot of questions about data normalization and encoding. A complete explanation would take approximately 20 pages, but here is a super-short summary of what to do.

Suppose you have this raw data:

Age Income      Education  Sex  Political
24  $24,000.00  Low        F    Democrat
62  $82,000.00  Medium     M    Republican
38  $64,000.00  High       M    Other
30  $40,000.00  Medium     F    Democrat
45  $42,000.00  High       F    Republican

Your goal is to predict Political affiliation from Age, Income, Education, and Sex.

1. Normalize the numeric predictor values so that they’re mostly in the same range (for example 0.0 to 1.0) using one of three techniques. This prevents large magnitudes like Income from overwhelming small magnitudes like Age.

a. Using order of magnitude normalization – divide each predictor value by a constant. For example, you could divide all Age values by 100 and all Income values by 100,000 giving:

Age    Income
0.24   0.2400
0.62   0.8200
0.38   0.6400
0.30   0.4000
0.45   0.4200

b. Using min-max normalization – each value x becomes x’ = (x – min) / (max – min). For example, for Age, the min value is 24 and the max value is 62. So the Age value x = 38 becomes (38 – 24) / (62 – 24) = 0.37 and so on:

Age    Income
0.00   0.00
1.00   1.00
0.37   0.69
0.16   0.28
0.55   0.31

c. Using z-score normalization – each value x becomes x’ = (x – mean) / sd. For example, the mean of the Age values is 39.80 and the sd is 13.18. So the Age value x = 38 becomes (38 – 39.80) / 13.18 = -0.14 and so on:

Age    Income
-1.20   -1.30
 1.68    1.56
-0.14    0.67
-0/74   -0.51
 0.39   -0.41

Of these three techniques I prefer order of magnitude normalization. It’s the easiest and based on my experience works better than min-max or z-score normalization.

2. Convert non-numeric predictor values by using one-hot encoding (for three or more possible values) or minus-one plus-one encoding (for Boolean / binary / two possible values).

For example, the Education predictor variable can be Low, Medium, or High. Let Low = (1, 0, 0), Medium = (0, 1, 0), High = (0, 0, 1) giving:

Education
1  0  0
0  1  0
0  0  1
0  1  0
0  0  1

For the Sex predictor, let M = -1 and F = +1 giving:

Sex
 1
-1
-1
 1
 1

3. For a classification problem, encode the values to predict using one-hot encoding (for three or more possible values) or 0-1 encoding (for two possible values). To predict Political affiliation let Democrat = (1, 0, 0), Republican = (0, 1, 0), Other = (0, 0, 1) giving:

Political
1  0  0
0  1  0
0  0  1
1  0  0
0  1  0

If you were trying to predict Sex instead of Political affiliation, let M = 0 and F = 1 giving:

Sex
 1
 0
 0
 1
 1

So, for the raw data above, if you wanted to predict Political affiliation, and you used order of magnitude normalization, the normalized and encoded data would be:

Age Income      Education  Sex  Political
24  $24,000.00  Low        F    Democrat
62  $82,000.00  Medium     M    Republican
38  $64,000.00  High       M    Other
30  $40,000.00  Medium     F    Democrat
45  $42,000.00  High       F    Republican
Age   Income  Education  Sex  Political
0.24  0.2400  1 0 0       1    1 0 0
0.62  0.8200  0 1 0      -1    0 1 0
0.38  0.6400  0 0 1      -1    0 0 1
0.30  0.4000  0 1 0       1    1 0 0
0.45  0.4200  0 0 1       1    0 1 0

Unfortunately, there are many exceptions and special cases. But fortunately, the guidelines here account for the most common scenarios. The guidelines here work for logistic regression as well as neural networks.



Three images from an Internet search for “normal neural network art”.

This entry was posted in Machine Learning. Bookmark the permalink.