Even though neural networks have been studied for decades, there are many issues that aren’t well understood by the engineering community. For example, if you search the Internet for information about implementing batch training, you’ll find a lot of questions but few answers.

In pseudo-code, there are two alternatives. One approach accumulates all deltas (increments) for the weights, and then updates:

loop maxEpochs times
for-each training item
compute gradients for each weight
use gradients to compute deltas
for each weight
accumulate the deltas
end-for
use accumulated deltas to update weights
end-loop

The second approach accumulates all gradients for the weights, and then updates:

loop maxEpochs times
for-each training item
compute gradients for each weight
accumulate the gradients
end-for
use accumulated gradients to compute
the deltas, and then update weights
end-loop

So, which approach is correct? Do you accumulate the deltas or the gradients? The answer is that both approaches are equivalent. The delta value for a weight connecting two nodes is a “learning-rate” times the gradient associated with the weight: delta[i,j] = learnRate * gradient[i,j]. So, mathematically, it doesn’t matter if you accumulate the weight deltas or the weight gradients.

I wrote some code to demonstrate. My demo generates 1,000 synthetic items where each item has four input (feature) values and three output (class) values. I wrote two different methods that train a NN, one that accumulates weight deltas and one that accumulates weight gradients. The results were identical.

This discussion is all about “full-batch” training, where all training items are processed before doing any weight updates. The issues are the same for mini-batch training where you process a chunk of training items at a time. However, for “stochastic”, or “online” learning, because you update weights after each training item is processed, there’s no issue about accumulating deltas or gradients.

### Like this:

Like Loading...

*Related*