I gave a talk about the back-propagation algorithm recently. Back-propagation is used to train a neural network. Consider a math equation like y = 5X1 + 7X2 so the equation has two inputs, X1 and X2, and two constants, 5 and 7, that determine the output. If you think of a NN as a very complex math equation, the weights of the NN are the constants. Training a NN is the process of using data with known correct input and output values, to find the values of the weights. And back-prop is the most common algorithm used for training.

A NN uses one or more internal activation functions. One common activation function is the logistic sigmoid, logsig(x) = 1.0 / (1.0 + e^-x). Back-propagation requires the Calculus derivative of the activation function. If y = logsig(x), then the Calculus derivative is y’ = e^-x / (1.0 + e^-x)^2 and by a very cool, non-obvious algebra coincidence y’ = y * (1 – y).

But for deep neural networks, a common activation function is ReLU(x) = max(0, x). If you graph y = ReLU(x) you can see that the function is mostly differentiable. If x is greater than 0 the derivative is 1 and if x is less than zero the derivative is 0. But when x = 0, the derivative does not exist.

There are two ways to deal with this. First, you can just arbitrarily assign a value for the derivative of y = ReLU(x) when x = 0. Common arbitrary values are 0, 0.5, and 1. Easy!

A second alternative is, instead of using the actual y = ReLU(x) function, use an approximation to ReLU which is differentiable for all values of x. One such approximation is called softplus which is defined y = ln(1.0 + e^x) which has derivative of y’ = 1.0 / (1.0 + e^-x) which is, remarkably, the logistic sigmoid function. Neat!

When I implement a deep NN from scratch, I usually use the arbitrary-value-when-x-equals-zero approach. I have never seen any research that looks at which of the two ways to deal with y = ReLU(x) being non-differentiable at 0, is better.

### Like this:

Like Loading...

*Related*

Calculus derivative could be written as e^x / (e^x+1) ^2 and logistic sigmoid as e^x / (e^x+1)

(Quite right)

its funny how i seam to ping pong back to your articles, i like your clean explanations, but from others i get left over with questions. I just learned that more often “leaky Relu” is used for deep networks, as compared to gausian Relu. ( as explained by Siraj Raval in : https://www.youtube.com/watch?v=-7scQpJT7uo ) Leaky Relu is said to be simpler in usage, and its also said that it would work with deep neural networks, but i kinda fail to verify yhis based upon your c# irish flower example (which is use often for test, as i’m more of a engineer wanting to use neural nets on hardware).

Yes, leaky ReLU is a common variation of ReLU. I like leaky ReLU a lot, and use it often.