Even though neural networks have been studied for decades, there are many issues that aren’t well understood by the engineering community. For example, if you search the Internet for information about implementing batch training, you’ll find a lot of questions but few answers.
In pseudo-code, there are two alternatives. One approach accumulates all deltas (increments) for the weights, and then updates:
loop maxEpochs times for-each training item compute gradients for each weight use gradients to compute deltas for each weight accumulate the deltas end-for use accumulated deltas to update weights end-loop
The second approach accumulates all gradients for the weights, and then updates:
loop maxEpochs times for-each training item compute gradients for each weight accumulate the gradients end-for use accumulated gradients to compute the deltas, and then update weights end-loop
So, which approach is correct? Do you accumulate the deltas or the gradients? The answer is that both approaches are equivalent. The delta value for a weight connecting two nodes is a “learning-rate” times the gradient associated with the weight: delta[i,j] = learnRate * gradient[i,j]. So, mathematically, it doesn’t matter if you accumulate the weight deltas or the weight gradients.
I wrote some code to demonstrate. My demo generates 1,000 synthetic items where each item has four input (feature) values and three output (class) values. I wrote two different methods that train a NN, one that accumulates weight deltas and one that accumulates weight gradients. The results were identical.
This discussion is all about “full-batch” training, where all training items are processed before doing any weight updates. The issues are the same for mini-batch training where you process a chunk of training items at a time. However, for “stochastic”, or “online” learning, because you update weights after each training item is processed, there’s no issue about accumulating deltas or gradients.