There are several topics related to neural network implementation that are the source of much confusion and incorrect information. Nesterov momentum (also called Nesterov Accelerated Gradient) is one such topic.
I was preparing to give a talk about neural network momentum, so I did a quick review of the Internet to see what common developer sites such as Stack Overflow had to say about Nesterov momentum. I was not terribly surprised to find a lot of misleading, and in many cases, just completely inaccurate information. I wasn’t surprised because Nesterov momentum is simple in principle, but extremely tricky in the details.
A full explanation of Nesterov momentum would takes many pages, so I’ll try to be brief at the expense of 100% correctness. When training a NN, on each iteration, you compute a delta for each weight. The standard delta is minus one times a small constant (the learning rate, typically something like 0.01) times the gradient. With regular momentum you add an additional term equal to a constant (the momentum constant, typically something like 0.8) times the previous delta.
With Nesterov momentum, in theory you calculate the gradient not for the current weights, but rather for the current weights plus the momentum constant times the previous delta. This is a deep idea. Unfortunately, it’s also quite annoying to actually compute in practice.
So, there’s an alternative form of Nesterov momentum where the delta looks quite a bit different but is (almost) exactly the same mathematically. The alternative form uses just the gradient calculated for the current weights, which is much easier to compute.
Anyway, there are a couple of morals to the story. First, with neural networks, everything is tricky. Second, there’s a somewhat surprising amount of incorrect information on the Internet about implementing neural networks — you really need to go to the original research papers.