Suppose you have some data that looks like this:
Age Sex Income Politics ========================== 25 +1 28,000 (0 0 1) = liberal 48 -1 99,000 (1 0 0) = conservative 36 -1 62,000 (0 1 0) = moderate 52 +1 34,000 (0 0 1) etc.
Your goal is to predict a person’s political leaning (conservative, moderate, liberal) from their Age, Sex (-1 = male, +1 = female), and annual Income.
The two most common machine learning approaches would be multi-class logistic regression and neural network classification. In any event, you don’t absolutely need to normalize your data but doing so will probably give you a better predictive model because in the un-normalized data, the income values are much, much larger than the age values, which in turn are larger than the sex values.
In the literature, by far the two most common forms of normalization are standard score and max-min normalization (both techniques have many alternative names).
In standard score normalization you calculate the mean and the standard deviation of each predictor column, and then normalize as (x – m) / sd. All values will be roughly between -10.0 and +10.0 except in really unusual cases.
In max-min normalization you calculate the max and min of each predictor column then normalize as (x – min) / (max – min). All values will be between 0.0 and +1.0.
I sometimes use a quick and dirty normalization technique I call order of magnitude normalization. For the Age column, I’d divide all values by 10. I’d leave the Sex values alone. I’d divide all Income values by 10,000. I’ve never read about this technique, but there must be research on it somewhere because it’s simple, obvious, and in my experience, pretty effective.