I had an interesting exchange of e-mail messages with some of my colleagues recently. The topic was the number of neural network output nodes/neurons to use for a classification problem. For example, suppose the problem at hand is to classify/predict a person’s sex (either male or female) based on some set of input data such as height, political party affiliation and so on. You could create a neural network with two output nodes, one for each sex, and use softmax activation so that the outputs sum to 1.0 and could be interpreted as probabilities. For example if output node 0 is male and output node 1 is female, and the neural network output values are 0.75 and 0.25, you’d conclude the person is a male.

On the other hand, you could create a neural network with just one output node, and use log-sigmoid activation so that the single output, which will be between 0 and 1, represents the probability of just one of the two classes (say, male). So if the output was 0.44, you’d infer the probability of male is 0.44 and therefore the probability of female is 0.56 and conclude the person is female.

Which approach is better? Either approach will, theoretically, give the same result. The two-output-node neural network design will, again in theory, be harder to train because there are additional weights and an extra bias value. However, and this is the main point of my post here, if you want to use weight decay, you should use the two-output-node design. Weight decay is a technique that penalizes large weights by adding a factor, which is proportional to the weights, to the error term, during training. When using weight decay, it seems reasonable to want explicit weights for all classes; but there is no research I’m aware of that investigates this idea. By the way, the point of weight decay is to limit over-fitting — over-fitting often generates large weight values, so avoiding large weight values will in principle help prevent over-fitting.

What about classification with more than two output classes? Suppose you are trying to classify/predict political party affiliation where there are four possible values: democrat, republican, independent, other. The principle is the same. You could design a neural network with four output nodes and use softmax activation, or you could use just three outputs combined with log-sigmoid activation.

My personal preference is to use the same number of output nodes as there are dependent variable classes. I think the extra training effort required is a relatively small price to pay in return for being able to use weight decay when I want, and for being able to directly interpret output values as probabilities. Most of my colleagues do not agree and prefer the n-1 output node design approach.