Encoding Non-Numeric Data for Statistics vs. for Machine Learning

In both classical statistics (linear regression and ANOVA) and machine learning (logistic regression and neural networks), you must transform non-numeric data into numbers. However, the encoding techniques are quite different — but are superficially similar. This different-but-similar can cause confusion for people moving from traditional data science to ML, and for ML people who find information about statistics encoding on the Internet when they’re looking for ML encoding.

Briefly: The three most common statistics encoding techniques are called dummy coding, effect coding, and orthogonal coding. The (single) most common encoding technique for ML is called 1-of-(N-1) encoding. The statistics effect coding and the ML 1-of-(N-1) encoding are the same even though their motivating concepts are different.

Suppose you have a categorical variable that can take one of four values: red, yellow, blue, green.

Statistics dummy coding is:

red    = (1, 0, 0)
yellow = (0, 1, 0)
blue   = (0, 0, 1)
------------------
green  = (0, 0, 0) or ignored

Statistics effect coding is:

red    = ( 1,  0,  0)
yellow = ( 0,  1,  0)
blue   = ( 0,  0,  1)
green  = (-1, -1, -1)

Statistics orthogonal coding is:

red    = (-3,  1, -1)
yellow = (-1, -1,  3)
blue   = ( 1, -1, -3)
green  = ( 3,  1,  1)

Machine Learning 1-of-(N-1) encoding is:

red    = ( 1,  0,  0)
yellow = ( 0,  1,  0)
blue   = ( 0,  0,  1)
green  = (-1, -1, -1)

Notice that in all cases, to encode a categorical variable that can take four values, you use three encoding variables. In general, to encode a variable that can take one of n values, you use n-1 encoding variables. The statistics effect coding is the same as ML 1-of-(N-1) encoding.

For statistics, you typically use dummy coding except when you want to analyze interaction effects (then you use effect coding) or you want to look at the contrasts between values (then you use orthogonal coding). There are many possible orthogonal encodings for a given n.

Orthogonal coding is a bit tricky. Notice the sum of each “encoding column” is zero:

red    = (-3,  1, -1)
yellow = (-1, -1,  3)
blue   = ( 1, -1, -3)
green  = ( 3,  1,  1)
=====================
sum    =   0   0   0

Also, the dot product of each pair of “encoding column vectors” is 0:

col_1 dot col_2 =
(-3 * 1) + (-1 * -1) + (1 * -1) + (3 * 1) =
-3 + 1 + -1 + 3 = 0

col_1 dot col_3 =
(-3 * -1) + (-1 * 3) + (1 * -3) + (3 * 1) =
3 + -3 + -3 + 3 = 0

col_2 dot col_3 =
(1 * -1) + (-1 * 3) + (-1 * -3) + (1 * 1) =
-1 + -3 + 3 + 1 = 0

When the dot product of vectors is 0, they’re called orthogonal.

A special case for each type of encoding is when the categorical variable is binary, for example, sex which can be male or female. For statistics dummy coding use male = 0 female = 1. For all other encoding schemes use male = -1 and female = +1.

Advertisements
This entry was posted in Machine Learning. Bookmark the permalink.