Neural Network Batch Training – Accumulate the Gradients or the Deltas?

Even though neural networks have been studied for decades, there are many issues that aren’t well understood by the engineering community. For example, if you search the Internet for information about implementing batch training, you’ll find a lot of questions but few answers.

In pseudo-code, there are two alternatives. One approach accumulates all deltas (increments) for the weights, and then updates:

loop maxEpochs times
  for-each training item
    compute gradients for each weight
    use gradients to compute deltas
      for each weight
    accumulate the deltas
  end-for
  use accumulated deltas to update weights
end-loop

The second approach accumulates all gradients for the weights, and then updates:

loop maxEpochs times
  for-each training item
    compute gradients for each weight
    accumulate the gradients
  end-for
  use accumulated gradients to compute
    the deltas, and then update weights
end-loop

So, which approach is correct? Do you accumulate the deltas or the gradients? The answer is that both approaches are equivalent. The delta value for a weight connecting two nodes is a “learning-rate” times the gradient associated with the weight: delta[i,j] = learnRate * gradient[i,j]. So, mathematically, it doesn’t matter if you accumulate the weight deltas or the weight gradients.

I wrote some code to demonstrate. My demo generates 1,000 synthetic items where each item has four input (feature) values and three output (class) values. I wrote two different methods that train a NN, one that accumulates weight deltas and one that accumulates weight gradients. The results were identical.

This discussion is all about “full-batch” training, where all training items are processed before doing any weight updates. The issues are the same for mini-batch training where you process a chunk of training items at a time. However, for “stochastic”, or “online” learning, because you update weights after each training item is processed, there’s no issue about accumulating deltas or gradients.

Advertisements
This entry was posted in Machine Learning. Bookmark the permalink.