The Differences Between Neural Multiclass Classification, Regression, and Binary Classification

There are three basic forms of neural networks: multiclass classification, regression, and binary classification. There are also many kinds of more sophisticated neural problems, such as image classification using a CNN, text analysis using an LSTM, and so on.

In multiclass classification you predict a variable that can be one of three or more categorical values, for example, predicting a person’s political leaning (conservative, moderate, liberal) from their age, annual income, sex, and education level.

In regression, you predict a single numeric value, for example, predicting annual income from political leaning, age, sex, and education level.

In binary classification, you predict a variable that can be just one of two possible categorical values, for example, predicting sex (male, female) from political leaning, age, annual income, and education level.

A hyperparameter is any value that you are free to choose, for example, what activation function to use on the hidden layer nodes, or the maximum number of training iterations. The three basic types of neural networks differ in three key characteristics: number of output nodes, output layer activation, and loss function for training.

For the number of output nodes, multiclass classification has the number of values to predict, for example, three if there are three possible values. Both regression and binary classification use one output node.

For output layer activation, multiclass classification uses softmax but with one huge warning: many neural network libraries have a categorical cross entropy loss function that automatically applies softmax during training. This means that when you are training the network, you do not explicitly apply softmax, but after the model has been trained, when making a prediction, you should explicitly apply softmax if you want the prediction values to sum to 1.0 so they can be interpreted as probabilities.

For regression output node activation, you do not use any function, or equivalently, you can say you’re using the identity function f(x) = x. For binary classification output node activation, you use logistic sigmoid activation so that the value is between 0.0 and 1.0 where less than 0.5 maps to class 0 and greater than 0.5 maps to class 1.

For the training loss function, for multiclass classification, you usually use categorical cross entropy (although mean squared error is perfectly acceptable). For regression you use mean squared error. For binary classification you use binary cross entropy (but again, mean squared error is fine).

Alas, there are many, many minor details and exceptions to this comparison cheat sheet. For example, for binary classification, you can encode the variable to predict as (1, 0) and (0, 1) and then use categorical cross entropy instead of binary cross entropy. But the rules of thumb presented in this post deal with most situations.

Rule of thumb – fingers – hand – helping hand – attendant – flight attendants. From left: Aeroflot (Russia), China Southern, Shenzhen (China), the famous (and rather creepy) VietJet stewardesses uniforms, a European airline, an American airline. I’m pretty sure that being a stewardess is a very difficult job – two of my good friends worked as flight attendants a few years ago.

This entry was posted in Machine Learning. Bookmark the permalink.