There has been a lot of recent research work done on deep neural networks. One result is that it’s now thought that using standard logistic sigmoid activation or tanh activation doesn’t work as well as rectified linear activation.
If you’re not familiar with neural networks this probably sounds like gibberish. I’ll try to explain. The key item in a neural network is called a hidden processing node. The value of a hidden node is the sum of the products of inputs into the node and corresponding weights, plus a bias constant, and then you take the tanh() of that sum.
The tanh() is called an activation function. The tanh() function can accept any value from negative infinity to positive infinity, and returns a value between -1.0 and +1.0. An alternative to tanh() is called logistic sigmoid, abbreviated sigmoid(), which is similar but returns a value between 0.0 and +1.0.
When a neural network is trained, you need to use the calculus derivative of the activation function. For tanh() the derivative is (1 – x)(1 + x). For sigmoid() the derivative is (x)(1 – x).
Rectified linear activation is so simple it’s confusing. In words you return 0 if x is negative, or you return x if x is positive. So for the example in the image above, the sum of products plus the bias is 1.05 so the final value after linear rectified activation is 1.05.
The calculus derivative is also too simple. If x is negative the derivative is 0. If x is positive, the derivative is 1.
It’s not 100% clear why linear rectified activation seems to work better than tanh() or sigmoid() for deep neural networks. For sure linear rectified doesn’t suffer from what’s called vanishing gradient. Linear rectified also implicitly introduces a form of what’s called dropout.
Very complex but interesting topic.