I gave a talk about the back-propagation algorithm recently. Back-propagation is used to train a neural network. Consider a math equation like y = 5X1 + 7X2 so the equation has two inputs, X1 and X2, and two constants, 5 and 7, that determine the output. If you think of a NN as a very complex math equation, the weights of the NN are the constants. Training a NN is the process of using data with known correct input and output values, to find the values of the weights. And back-prop is the most common algorithm used for training.

A NN uses one or more internal activation functions. One common activation function is the logistic sigmoid, logsig(x) = 1.0 / (1.0 + e^-x). Back-propagation requires the Calculus derivative of the activation function. If y = logsig(x), then the Calculus derivative is y’ = e^-x / (1.0 + e^-x)^2 and by a very cool, non-obvious algebra coincidence y’ = y * (1 – y).

But for deep neural networks, a common activation function is ReLU(x) = max(0, x). If you graph y = ReLU(x) you can see that the function is mostly differentiable. If x is greater than 0 the derivative is 1 and if x is less than zero the derivative is 0. But when x = 0, the derivative does not exist.

There are two ways to deal with this. First, you can just arbitrarily assign a value for the derivative of y = ReLU(x) when x = 0. Common arbitrary values are 0, 0.5, and 1. Easy!

A second alternative is, instead of using the actual y = ReLU(x) function, use an approximation to ReLU which is differentiable for all values of x. One such approximation is called softplus which is defined y = ln(1.0 + e^x) which has derivative of y’ = 1.0 / (1.0 + e^-x) which is, remarkably, the logistic sigmoid function. Neat!

When I implement a deep NN from scratch, I usually use the arbitrary-value-when-x-equals-zero approach. I have never seen any research that looks at which of the two ways to deal with y = ReLU(x) being non-differentiable at 0, is better.