When training a neural network using the back-propagation algorithm where you update after looking at a single training item (stochastic or online training), you must supply a value for the learning rate. The learning rate controls how much each weight and bias is changed in each iteration of training. For example, Python code to update the weight that connects input node [i] to hidden node [j] could look like:
delta = learnRate * ihGrads[i,j] self.ihWeights[i,j] += delta
Here the ihGrads[i,j] is the gradient associated with the weight, which has been calculated earlier in the code. For example, if a current input-to-hidden weight has value +3.56 and the learning rate is 0.01 and the associated gradient is +2.78 then the new value of the weight is: w = 3.56 + (0.01 * 2.70) = 3.56 + 0.027 = 3.587.
If you pick a small learning rate, training will proceed slowly but surely, however, it could be too slow, taking hours or even days. But if you pick a large learning rate, you could overshoot a good answer and then on the next iteration undershoot, and get into an oscillating pattern where training never converges.
The purpose of momentum is to speed up training but with a reduced risk of oscillating. In code, momentum could be implemented like:
delta = learnRate * ihGrads[i,j] self.ihWeights[i,j] += delta self.ihWeights[i,j] += momentum * ih_prev_weights_delta[i,j] ih_prev_weights_delta[i,j] = delta
Technically, momentum is very simple. After updating a weight, you update a second time using the value of the previous update delta amount. Suppose the momentum rate is set to 0.50 and the delta from the previous iteration was 0.044 then the value of the weight example from above would be w = 3.56 + (0.01 * 2.70) + (0.50 * 0.044) = 3.56 + 0.027 + 0.022 = 3.609.
The reason why momentum helps is actually quite subtle and most explanations I’ve seen on the Internet are a bit misleading.
A typical explanation is that momentum helps training escape being trapped in local minima, by jumping over such minima. But that explanation doesn’t fully explain why momentum wouldn’t also jump over the global error minimum you’re looking for.
In essence, during the early part of training, momentum just moves you faster. Later, when you are closer to convergence, the update delta values become very small so the momentum becomes small and so you won’t be as likely jump over the global minimum. Put another way, momentum is a technique that creates an adaptive learning rate, and one that varies for different weights.
The downside to using momentum is that the value of the momentum factor is a free parameter and you have to use trial and error to find a good value.