I’m working with a group of people whose goal is to create a prediction system that uses some kind of deep neural network. It shouldn’t surprise you that in order to understand deep neural networks, you have to understand ordinary, single-layer neural networks.
So, to establish a baseline set of knowledge, I gave a short talk about what neural networks are, and how they work. One emphasis of my talk was to explain how vocabulary in the field varies wildly. For example, suppose you want to predict the annual income of a person based on their age, years of education, sex, and so on. The predictor variables can be called “features”, “independent variables”, “attributes”, “X values”, or several other terms.
One detail I discussed was the softmax function. Suppose you are trying to predict the political leaning of a person and the possible values are (conservative, moderate, liberal). A neural network is essentially a complex math function that will generate three values like (2.0, 4.0, 3.0). These values are usually normalized so that they sum to 1.0 and then they can be interpreted as probabilities.
The softmax function for these three values does that:
2.0 -> e^2.0 / (e^2.0 + e^4.0 + e^3.0) = 0.09 4.0 -> e^4.0 / (e^2.0 + e^4.0 + e^3.0) = 0.67 3.0 -> e^3.0 / (e^2.0 + e^4.0 + e^3.0) = 0.24
In this case, the largest probability is associated with the middle value, so the outputs map to (0, 1, 0) which means the prediction is the middle value, “moderate”. Most of the audience knew this. So I posed the question, “Why go to so much trouble to get numbers to sum to 1.0 when you can just divide each value by the sum of the values?”
2.0 -> 2.0 / (2.0 + 4.0 + 3.0) = 0.22 4.0 -> 4.0 / (2.0 + 4.0 + 3.0) = 0.44 3.0 -> 3.0 / (2.0 + 4.0 + 3.0) = 0.33
The answer is that in order to train a neural network using back-propagation, you need the Calculus derivative of the normalizing function. The softmax function has a simple and beautiful derivative, but the second approach doesn’t have an easy derivative.
The moral of the story is that, if you’re new to neural networks, you shouldn’t underestimate the confusion that varying vocabulary can cause, and that neural networks have layers and layers of details, some of which are important, and some which are interesting but not critically important.