Encoding Binary Predictor Variables for Neural Networks

In spite of decades of research on neural networks, there are still many fundamental ideas that are not well understood. One such topic is encoding binary predictors (also called binary features).

The three most common ways to encode a binary predictor variable, such as the sex of a person, are one-hot encoding, 0-1 encoding, and minus-one-plus-one encoding. In practice, the choice of encoding technique doesn’t make a big difference.

Suppose you are using sex as a predictor variable. If you use one-hot encoding, you could set male = (1, 0) and female = (0, 1). Suppose the optimal weights and bias are w1 = 0.60, w2 = 0.30, b = 0.10 which would be learned during training. With one-hot encoding an input value of male contributes a hidden node value of (1 * 0.60) + (0 * 0.30) + 0.1 = 0.70 and an input value of female contributes (0 * 0.60) + (1 * 0.30) + 0.10 = 0.40.

Suppose you use 0-1 encoding with male = 1 and female = 0. If trained optimally, the neural network would find w = 0.30 and b = 0.40. An input of male generates (1 * 0.30) + 0.40 = 0.70 and an input of females generates (0 * 0.30) + 0.40 = 0.40 which are the same hidden node values as with one-hot encoding.

Suppose you use minus-one-plus-one encoding with male = -1 an female = +1. If trained optimally, the neural network would find w = -0.15 and b = 0.55. An input of male generates (-1 * -0.15) + 0.55 = 0.70 and an input of female generates (+1 * -0.15) + 0.55 = 0.40 which are the same hidden node values.

The main difference between the three encoding techniques is that one-hot encoding requires two weights per binary variable, while 0-1 encoding and minus-one-plus-one encoding require only one weight per binary variable. Therefore, one-hot encoding requires a tiny bit of extra effort during training.

In theory, minus-one-plus-one encoding should be a tiny bit easier to train than 0-1 encoding. When using minus-one-plus-one encoding, for a given network weight the difference between the computed output and the target value will be larger than the difference when using 0-1 encoding. Therefore, the gradient will be larger, and the weight update delta will be larger, and the weight value will converge more quickly. But this is really splitting hairs. Based on my experience the difference between the three encoding techniques is not significant in a practical sense.

A minor engineering advantage of using one-hot encoding is that, assuming you’re using one-hot encoding for categorical predictor variables with that have three or more possible values, you only need one type of encoding.

The bottom line is that there is no significant advantage for any of the three most common encoding techniques for binary predictor variables.

There are many possible predictor variables related to people. For example, age, height, race, income, and so on. Gender is one of the few inherently binary variables that I can think of.