During neural network training, it’s possible to use a momentum factor. Momentum is a technique designed to speed up training. But I hardly ever see momentum used. The main problem with momentum is that it adds another hyperparameter, the momentum factor, and the time spent determining a good value for the momentum factor outweighs the benefit in speed.

There are two types of momentum — plain momentum and Nesterov momentum. Nesterov momentum is a more technically sophisticated version of regular momentum. See https://jamesmccaffrey.wordpress.com/2017/07/24/neural-network-nesterov-momentum/.

As usual, the idea is best explained by a concrete example. In the images below, I use no momentum, regular momentum (factor = 0.95), and Nesterov momentum (0.95). If you look at the loss values, you can see the the two momentum runs do in fact train faster. But if you look at the accuracy metrics, you can see that the no-momentum version has the best test accuracy. The point is that training speed isn’t the only thing that’s important.

*Left: No momentum. Center: Regular momentum, factor = 0.95. Right: Nesterov momentum, factor = 0.95.*

The key statements are:

max_epochs = 1000 ep_log_interval = 100 lrn_rate = 0.01 loss_func = T.nn.NLLLoss() # assumes log_softmax() # 1. optimizer = T.optim.SGD(net.parameters(), lr=lrn_rate) # 2. optimizer = T.optim.SGD(net.parameters(), lr=lrn_rate, # momentum=0.95) # 3. optimizer = T.optim.SGD(net.parameters(), lr=lrn_rate, # nesterov=True, momentum=0.95, dampening=0) . . .

I’m using stochastic gradient descent (SGD). The learning rate is required and tuning the learning rate is a major challenge. The first version doesn’t use any momentum. The second version uses regular momentum with factor 0.95. The third version uses Nesterov momentum with factor 0.95 (the dampening of 0 is required for Nesterov).

So, without going into all the technical details, it’s hard enough to find a good learning rate, and if you add trying to find a good value for the momentum factor, you greatly complicate things.

Neural network training momentum is one of several topics that are great in theory, but just don’t work too well in practice.

*When I was a college student, I did well in math classes but poorly in physics classes. I never did quite figure out momentum, inertia, and angular momentum. Angular momentum was often illustrated by a spinning bicycle wheel.*

*The Raleigh Bicycle Company was founded in 1885 in Nottingham, England. Three pieces of old Raleigh advertising that are interesting but somewhat difficult to figure out. Left: Why the jet? Why the ominous sky? Center: Why . . . all of it? Right: What is she doing and what does it have to do with bicycles?*

True words, the results are usually better and easier without momentum. Nevertheless, we try again and again.

Maybe The Ultimate Optimizer can help.

https://openreview.net/pdf?id=-Qp-3L-5ZdI (credit Jörn Loviscach)

It’s hard to say what makes a good training. But nothing seems to replace experience.