Deriving the Gradient for Neural Network Back-Propagation with Cross-Entropy Error

You can think of a neural network as a very complicated math function that has constants called weights (and special weights called biases). Training a neural network is the process of finding the values of the weights.

The most common form of NN training is called the back-propagation algorithm. Back-propagation requires the calculus gradient. The calculus gradient depends on the error function used to compare calculated output vales with correct, target values from some training data. The two most common error functions are squared error and cross-entropy error.

A staple of every machine learning course is the derivation of the back-propagation gradient. I’ll show it here, and then explain why this entire blog post really doesn’t make much sense.

First, imagine a neural network that has two or more output nodes, and which uses the logistic sigmoid function for activation, and assumes cross-entropy error.

The gradient for one hidden-to-output node, in symbols, is:


This means, “the partial derivative of the error with respect to the weight from hidden node j to output node k.”

The equation for cross-entropy error is:


The output function is:


And the s (sum of weight and hidden node values) function is:


To find the gradient, you can use the Calculus chain rule.


There you have it. . . not.

The problem with this derivation is that if you’ve seen it before and already understand it, then there’s no new information. But if you don’t already understand this derivation, it’s pretty much meaningless because pages and pages of details have been left out.

Also, if you’re a developer, then this math doesn’t help you if you want to write actual code!

I guess my point is that the back-propagation algorithm is very, very difficult to understand, mostly because it has dozens of interrelated components that you must understand first.

This entry was posted in Machine Learning. Bookmark the permalink.