There are several topics related to neural network implementation that are the source of much confusion and incorrect information. Nesterov momentum (also called Nesterov Accelerated Gradient) is one such topic.

I was preparing to give a talk about neural network momentum, so I did a quick review of the Internet to see what common developer sites such as Stack Overflow had to say about Nesterov momentum. I was not terribly surprised to find a lot of misleading, and in many cases, just completely inaccurate information. I wasn’t surprised because Nesterov momentum is simple in principle, but extremely tricky in the details.

A full explanation of Nesterov momentum would takes many pages, so I’ll try to be brief at the expense of 100% correctness. When training a NN, on each iteration, you compute a delta for each weight. The standard delta is minus one times a small constant (the learning rate, typically something like 0.01) times the gradient. With regular momentum you add an additional term equal to a constant (the momentum constant, typically something like 0.8) times the previous delta.

With Nesterov momentum, in theory you calculate the gradient not for the current weights, but rather for the current weights plus the momentum constant times the previous delta. This is a deep idea. Unfortunately, it’s also quite annoying to actually compute in practice.

So, there’s an alternative form of Nesterov momentum where the delta looks quite a bit different but is (almost) exactly the same mathematically. The alternative form uses just the gradient calculated for the current weights, which is much easier to compute.

Anyway, there are a couple of morals to the story. First, with neural networks, everything is tricky. Second, there’s a somewhat surprising amount of incorrect information on the Internet about implementing neural networks — you really need to go to the original research papers.

### Like this:

Like Loading...

*Related*

Is this that math, that tries to dampen the weights; as if during training. If the weights where put on a graph. Then one would see some kind of tilted sinus waves (going towards a certain point direction, with a decreasing amplitude nearing a solution). Last week i saw something like that, where people where able to reduce training time by a big factor. (the idea was that after a few sinus perputations, one could use the halve of the sinus amplitude, … and i think one can repeat that to dampen faster to a certain solution. But i dont know how it was called.

i mean like this : https://www.youtube.com/watch?v=7HZk7kGk5bU