Suppose you are using a neural network to make prediction where the thing-to-predict can be one of three possible values. For example, you might want to predict the political party affiliation of a person (democrat, republican, other) based on things like age, annual income, sex, and years of education.
A neural network classifier would accept four numeric inputs corresponding to age, income, sex, education and then generate a preliminary output of three values like (1.55, 2.30, 0.90) but then normalize the preliminary outputs so that they sum to 1.0 and can be interpreted as probabilities.
By far the most common normalizing function is called Softmax:
exp(1.55) = 4.71 exp(2.30) = 9.97 exp(0.90) = 2.46 sum = 17.15 softmax(1.55) = 4.71 / 17.15 = 0.28 softmax(2.30) = 9.97 / 17.15 = 0.58 softmax(0.90) = 2.46 / 17.15 = 0.14
If you are using the back-propagation algorithm for training, then you need to use the Calculus derivative of the Softmax function, which is softmax'(x) = x * (1-x).
I’d always wondered if there were alternatives to the Softmax function. I tracked down a rather obscure research paper published in 2016 that explored something called the Taylor Softmax function. The Taylor Softmax for the example values above is:
taylor(1.55) = 1.0 + 1.55 + 0.5 * (1.55)^2 = 3.75 taylor(2.30) = 1.0 + 2.30 + 0.5 * (2.30)^2 = 5.96 taylor(0.90) = 1.0 + 0.90 + 0.5 * (0.90)^2 = 2.31 sum = 12.00 taylor-soft(1.55) = 3.75 / 12.00 = 0.31 taylor-soft(2.30) = 5.96 / 12.00 = 0.50 taylor-soft(0.90) = 0.90 / 12.00 = 0.19
The Calculus derivative of the Taylor Softmax is rather ugly:
I coded up a demo program to compare regular Softmax with the Taylor Softmax. My non-definitive mini-exploration showed the regular Softmax worked much better.
My conclusion: Almost everything related to neural networks is a bit tricky. The Taylor Softmax activation function may be worth additional investigation, but my micro-research example leaves me a bit skeptical about the usefulness of Taylor Softmax.