I was giving a workshop on machine learning recently. There were a lot of questions about data normalization and encoding. A complete explanation would take approximately 20 pages, but here is a super-short summary of what to do.
Suppose you have this raw data:
Age Income Education Sex Political 24 $24,000.00 Low F Democrat 62 $82,000.00 Medium M Republican 38 $64,000.00 High M Other 30 $40,000.00 Medium F Democrat 45 $42,000.00 High F Republican
Your goal is to predict Political affiliation from Age, Income, Education, and Sex.
1. Normalize the numeric predictor values so that they’re mostly in the same range (for example 0.0 to 1.0) using one of three techniques. This prevents large magnitudes like Income from overwhelming small magnitudes like Age.
a. Using order of magnitude normalization – divide each predictor value by a constant. For example, you could divide all Age values by 100 and all Income values by 100,000 giving:
Age Income 0.24 0.2400 0.62 0.8200 0.38 0.6400 0.30 0.4000 0.45 0.4200
b. Using min-max normalization – each value x becomes x’ = (x – min) / (max – min). For example, for Age, the min value is 24 and the max value is 62. So the Age value x = 38 becomes (38 – 24) / (62 – 24) = 0.37 and so on:
Age Income 0.00 0.00 1.00 1.00 0.37 0.69 0.16 0.28 0.55 0.31
c. Using z-score normalization – each value x becomes x’ = (x – mean) / sd. For example, the mean of the Age values is 39.80 and the sd is 13.18. So the Age value x = 38 becomes (38 – 39.80) / 13.18 = -0.14 and so on:
Age Income -1.20 -1.30 1.68 1.56 -0.14 0.67 -0/74 -0.51 0.39 -0.41
Of these three techniques I prefer order of magnitude normalization. It’s the easiest and based on my experience works better than min-max or z-score normalization.
2. Convert non-numeric predictor values by using one-hot encoding (for three or more possible values) or minus-one plus-one encoding (for Boolean / binary / two possible values).
For example, the Education predictor variable can be Low, Medium, or High. Let Low = (1, 0, 0), Medium = (0, 1, 0), High = (0, 0, 1) giving:
Education 1 0 0 0 1 0 0 0 1 0 1 0 0 0 1
For the Sex predictor, let M = -1 and F = +1 giving:
Sex 1 -1 -1 1 1
3. For a classification problem, encode the values to predict using one-hot encoding (for three or more possible values) or 0-1 encoding (for two possible values). To predict Political affiliation let Democrat = (1, 0, 0), Republican = (0, 1, 0), Other = (0, 0, 1) giving:
Political 1 0 0 0 1 0 0 0 1 1 0 0 0 1 0
If you were trying to predict Sex instead of Political affiliation, let M = 0 and F = 1 giving:
Sex 1 0 0 1 1
So, for the raw data above, if you wanted to predict Political affiliation, and you used order of magnitude normalization, the normalized and encoded data would be:
Age Income Education Sex Political 24 $24,000.00 Low F Democrat 62 $82,000.00 Medium M Republican 38 $64,000.00 High M Other 30 $40,000.00 Medium F Democrat 45 $42,000.00 High F Republican
Age Income Education Sex Political 0.24 0.2400 1 0 0 1 1 0 0 0.62 0.8200 0 1 0 -1 0 1 0 0.38 0.6400 0 0 1 -1 0 0 1 0.30 0.4000 0 1 0 1 1 0 0 0.45 0.4200 0 0 1 1 0 1 0
Unfortunately, there are many exceptions and special cases. But fortunately, the guidelines here account for the most common scenarios. The guidelines here work for logistic regression as well as neural networks.
Three images from an Internet search for “normal neural network art”.
You must be logged in to post a comment.