In spite of decades of research on neural networks, there are still some fundamental concepts that are not well-understood in the data science community. One of these concepts is how you should encode binary predictors. For example, if you’re trying to predict the income of a person from several variables such as sex, age, state of residence, and so on, should you encode male = 0 and female = 1, or male = -1 and female = +1?
The answer is that in most cases it doesn’t matter but in practice using -1 and +1 is preferable. Briefly, if you encode sex as 0 and 1, then whenever an input node is 0, the associated weights to the hidden nodes are irrelevant because you’re multiplying them by 0, and so all the discriminatory power must be handled by the connecting hidden nodes’ bias values. If you encode as -1 and +1, the associated weights always directly contribute to the value of the connecting hidden node.
A closely related argument is that if you use 0 and 1 encoding, you only have a gap of 1.0 to work with. If you use -1 and +1 encoding, you have a gap of 2.0 to work with. In theory this doesn’t matter because if you train the neural network long enough, the associated weights will adapt to give good values.
Unexpectedly, for this demo problem, encoding sex with 0 and 1 gave a slightly better result than encoding with -1 and +1. I speculate that when using -1 and +1 you have to train a bit longer to get equivalent results because there are, in effect, additional weights.
Some people think that you can’t use 0 and 1 encoding for binary predictors at all. This notion comes from a famous old Internet neural network FAQ first published by the SAS company in the 1990s. The FAQ was, and still is, a very good resource. However, one section of the FAQ gives an example where encoding binary predictors as 0 and 1 doesn’t work well at all. The example problem is the 5-bit parity problem where the inputs are 5 bits like (1 0 1 0 1) and the target value to predict is the parity, which is 0 or 1. For this problem using -1 and +1 for the predictors works much, much better than using 0 and 1. However, this example is pathological because all the predictors are binary, and the dependent value to predict is binary too.
So, the bottom line is that, except for some pathological examples, you can encode binary predictors either way, but using -1 and +1 encoding is more principled and often better in practice.
I coded up an example using PyTorch. The goal is to predict a person’s political leaning (conservative, moderate, liberal) from sex, age, region (eastern, western, central), and income. I created a simple 6-10-3 classifier. Then I trained two different neural networks. The first network was trained using a data file where sex is male = 0, female = 1. The second network was trained on the same data except that the encoding was male = -1, female = +1. I was mildly surprised when the theoretically inferior 0 and 1 encoding gave a slightly better model than the -1 and +1 encoding. This one example doesn’t prove anything of course, but it does suggest that, in some scenarios at least, how you encode binary predictor variables isn’t a critical factor.
Two models by artist Daniel Agdag. The models are made entirely from cardboard. Unexpected and nice.
You must be logged in to post a comment.