Neural Network Nesterov Momentum

There are several topics related to neural network implementation that are the source of much confusion and incorrect information. Nesterov momentum (also called Nesterov Accelerated Gradient) is one such topic.

I was preparing to give a talk about neural network momentum, so I did a quick review of the Internet to see what common developer sites such as Stack Overflow had to say about Nesterov momentum. I was not terribly surprised to find a lot of misleading, and in many cases, just completely inaccurate information. I wasn’t surprised because Nesterov momentum is simple in principle, but extremely tricky in the details.

A full explanation of Nesterov momentum would takes many pages, so I’ll try to be brief at the expense of 100% correctness. When training a NN, on each iteration, you compute a delta for each weight. The standard delta is minus one times the gradient. With regular momentum you add an additional term equal to a constant (the momentum constant, typically something like 0.8) times the previous delta.

With Nesterov momentum, in theory you calculate the gradient not for the current weights, but rather for the current weights plus the momentum constant times the previous delta. This is a deep idea. Unfortunately, it’s also quite annoying to actually compute in practice.

So, there’s an alternative form of Nesterov momentum where the delta look quite a bit different but is (almost) exactly the same mathematically. The alternative form uses just the gradient calculated for the current weights, which is much easier to compute.

Anyway, there are a couple of morals to the story. First, with neural networks, everything is tricky. Second, there’s a somewhat surprising amount of incorrect information on the Internet about implementing neural networks — you really need to go to the original research papers.

Posted in Machine Learning | Leave a comment

My Top Ten Favorite Semi-Obscure Super Hero Comic Book Titles of the 1960s

I’ve always had a special fondness for comic books of the 1960s, the so-called Silver Age. Everyone knows Superman and Spiderman, and so on. But there were some interesting super hero comic book titles that aren’t as well known. Here are my 10 favorite lesser-known titles that you might not have heard of.

1. Adam Strange – Adam Strange was an archeologist who was accidentally teleported to the planet Rann. He had no built-in super powers and relied on his wits and Rannian technology to defeat enemies. The stories tended to be a bit more scientific than other titles of the 60s.


2. Ant-Man – Ant-Man is now fairly well-known because of the 2015 movie. The comic character first appeared in 1962. Scientist Hank Pym invents a way to shrink his size. The character became Giant-Man in 1963 but I always preferred Ant-Man.


3. The Fly – The Fly was an Archie Comics publishing character who first appeared in 1958. Tommy Troy turned into The Fly using a magic ring and could walk on walls, had great vision, and was very agile. Of all the possible insects to emulate, I always wondered why a fly?


4. Metal Men – The Metal Men were six robots created by scientist William Magnus. Each robot was made of a different metal: Gold (the leader), Iron (strong), Lead (loyal but dim), Mercury (hot-tempered), Tin (shy), and the female Platinum (flirtatious). Very unusual series with distinctive, vaguely psychedelic artwork.


5. The Atom – The Atom first appeared in 1961. Scientist Ray Palmer uses white dwarf star material to create a shrinking ray. The plots usually involved some situation where The Atom battled aliens, mad scientists, or common criminals.


6. Turok – Turok and his pal Andar are ordinary Native American Indians who get trapped in a valley with dinosaurs. Published first by Dell and then by the spin-off Gold Key Comics, this title had great artwork and fantastic cover art.


7. Hawkman – The Silver Age Hawkman first appeared in 1964. He was Katar Hol, an alien policeman from the planet Thanagar. Hawkman was never that popular and appeared mostly as a secondary character in other titles.


8. The Martian Manhunter – He was J’onn J’onzz from the planet Mars who is accidentally teleported to Earth. He had super powers that more or less varied as plot lines required.


9. Solar – He was Doctor Solar, Man of the Atom. A Gold Key publication with very nice artwork. Physicist Dr. Phillip Solar survives a nuclear accident that gives his body weird radiation powers. Radiation, even back in the 60s, was a scary topic and reading Solar always makes me slightly uneasy, but I enjoy the title.


10. The Jaguar – An Archie Comics super hero. He is zoologist Ralph Hardy. He gains the powers of a jaguar when he wears a mystical nucleon energy belt. There were only 15 issues of “The Adventures of the Jaguar” although he appeared as a guest in other titles too. The Jaguar feels very old-fashioned and simplistic, but that is its charm to me.


Posted in Top Ten | Leave a comment

Neural Network Cross Entropy Error using Python

I wrote an article in the July 2017 issue of Visual Studio Magazine titled “Neural Network Cross Entropy Error using Python”. See

For beginners to neural networks, cross entropy error (also called “log loss”) can be very confusing. Cross entropy error is actually quite simple, but like many topics in mathematics, there are many, many ways to look at CE error and so there are many, many different explanations. These explanations at first seem very different, but in fact are the same mathematically, but it takes a lot of time to understand the relationships.

In the early days of neural networks, the nearly universal technique used to compare computed output values to desired target values (from training data) was mean squared error (MS error). For example, suppose for a given set of input values and current NN weight values, the computed output values are (0.20, 0.70, 0.10). If the target values are (0, 1, 0) then squared error is (0.20 – 0)^2 + (0.70 – 1)^2 + ((0.10 – 0)^2 = 0.04 + 0.09 + 0.01 = 0.14. If you computed squared error for all the training items and then took the average, you’d have mean squared error.

Cross entropy error for the same data as above would be -[ln(0.20)*0 + ln(0.70)*1 + ln(0.10)*0] = -(0 + (-0.36) + 0) = 0.36. Notice that for neural network classification, because target values will have just one 1-value and all the rest 0-values, only one term doesn’t drop out of the calculation.

In my article I explain how to use cross entropy error with neural network back-propagation training. As it turns out, using cross entropy error usually leads to better results than mean squared error (the explanation of why is too long for this blog post), and so CE error is now the default error measurement used for neural networks.

The moral is that, if you’re a beginner to NNs, the amount of detail can seem overwhelming at first. But there are only a finite number of things that are essential. Understanding cross entropy error is one of those essential topics.

Posted in Machine Learning | Leave a comment

Moments and Machine Learning

Recently, I had an interesting discussion with some colleagues about what are essential mathematics topics for machine learning engineering. Every now and then, in machine learning literature, the terms “first moment” and “second moment” will pop up. If you are an engineer and don’t know what these terms mean, you won’t understand the article.

In short, the first moment of a set of numbers is just the mean (that is, the average) and the second moment is usually just the variance. However, by themselves, the terms “first moment” and “second moment” are ambiguous. Let me explain.

The Moment

Suppose you have four numbers (x0, x1, x2, x3). The first raw moment is (x0^1 + x1^1 + x2^1 + x3^1) / 4 which is nothing more than the average. For example, if your four numbers are (2, 3, 6, 9) then the first raw moment is (2^1 + 3^1 + 6^1 + 9^1) / 4 = (2 + 3 + 6 + 9) / 4 = 20/4 = 5.0.

In words, to compute the raw first moment of a set of numbers, you raise each number to 1 (which has no effect), sum, then divide by the number of numbers.

The second raw moment of a set of numbers is just like the first moment, except that instead of raising each number to 1, you raise to 2 (i.e., square). Put another way, the second raw moment of four numbers is (x0^2 + x1^2 + x2^2 + x3^2) / 4. For (2, 3, 6, 9) the second raw moment is (2^2 + 3^2 + 6^2 + 9^2) / 4 = (4 + 9 + 36 + 81) / 4 = 130/4 = 32.5.

There’s also a raw third moment (raise each number to 3), and raw fourth moment (raise each number to 4), and so on.

But. In mathematics, there’s always a “but”. In addition to the first and second raw moments, there’s also a central moment where before raising to a power, you substract the mean. For example, the second central moment of four numbers is [(x0-m)^2 + (x1-m)^2 + (x2-m)^2 + (x3-m)^2] / 4. For (2, 3, 6, 9), the second central moment is [(2-5)^2 + (3-5)^2 + (6-5)^2 + (9-5)^2] / 4 = (9 + 4 + 1 + 16) / 4 = 30/4 = 7.5 which is the population variance of the four numbers.

The first central moment of a set of numbers is, weirdly, always 0. For the four example numbers, the first central moment is [(2-5)^1 + (3-5)^1 + (6-5)^1 + (9-5)^1] / 4 = (-3 + -2 + 1 + 4) / 4 = 0/4 = 0.

To summarize, in machine learning, the term “first moment” often means the “first raw moment” (which is the mean) and the term “second moment” often means “the second central moment”, which is the variance. But not always. For example, the Adam optimization algorithm uses a first and second moment, but both moments are raw. When reading an article, and the difference matters, you need to ask which moment, raw or central, the author/person means.

Final Moment

Wrong Moment

Awkward Moment

Posted in Machine Learning | Leave a comment

Coding an Adam Optimization Algorithm Demo

Adam is a machine learning optimization algorithm. “Adam” is not an acronym strictly speaking (which is why it’s not capitalized) but it stands for “adaptive moment estimation”.

Adam was first published in July 2015 (24 months ago as I write this post) and has quickly become one of the main algorithms used for neural network training. Things are moving very fast in the field of machine learning.

The only way I can completely understand an algorithm is if I can implement the algorithm in code. So this morning, while on a break from speaking at a conference I’m attending, I fired up Visual Studio and tackled an Adam demo.

I kept things simple and attempted to implement Adam to find the minimum of a dummy loss function w0^2 + w1^2, which is sometimes called the sphere function. The solution is w0 = w1 = 0.

I located the source research paper at Compared to many research papers, the Adam paper is very well written and I didn’t have too much difficulty understanding it – but I’ve been reading such papers for many years so it makes sense that I’d understand the paper.

Anyway, after a couple of hours, I had a demo up and running. There’s still a lot about the Adam algorithm I don’t understand yet, but coding up a demo is a big first step towards full understanding.

Posted in Machine Learning | Leave a comment

One of my Favorite Logic Puzzles

Here’s one of my favorite logic puzzles.

Two friends, a programmer and a mathematician, get together for drinks after work one day at the programmer’s house. The mathematician asks the programmer how his three children are doing. The programmer replies that one of his three children just had a birthday.

The mathematician then asks, “How old are your children now?” The programmer answers, “The product of their ages is 36.” The mathematician thinks for a moment and says, “That’s not enough information.” The programmer says, “OK then, the sum of their ages equals my house street address number.” The mathematician steps outside to check the address number, comes back inside, and says, “That’s still not enough information.” The programmer then says, “Well my oldest child has red hair.”

The mathematician immediately responded, “Oh, now I know the ages!” and told the programmer what the ages were. What are the ages of the programmer’s three children?

The children are aged 2, 2, and 9 years old (there are two-year old twins). The mathematician’s logic was that since there are three children whose ages multiply to 36, the eight possible combinations are:

1, 1, 36
1, 2, 18
1, 3, 12
1, 4, 9
1, 6, 6
2, 2, 9
2, 3, 6
3, 3, 4

Initially, any of these combinations could be the correct ages. After the programmer says that the sum of the ages is the same as the house address, the mathematician mentally computed the sum of each possible combination:

1, 1, 36  sum = 38
1, 2, 18  sum = 21
1, 3, 12  sum = 16
1, 4, 9   sum = 14
1, 6, 6   sum = 13
2, 2, 9   sum = 13
2, 3, 6   sum = 11
3, 3, 4   sum = 10

Notice that all the sums are different except for (1, 6, 6) and (2, 2, 9) which both sum to 13. If the programmer’s address was anything except for 13, then the mathematician would know the ages, so the three ages must be one of those two combinations that sum to 13. But after the programmer said that the oldest child has red hair, the mathematician knew that there was a single oldest child which eliminates the (1, 6, 6) combination which has oldest twins, leaving just (2, 2, 9) as the ages of the children.


Posted in Miscellaneous | 1 Comment

I Give a Talk about Neural Network Dropout

There are pros and cons about working at a huge company. One of the very best things about working at Microsoft is the research talks that happen every day on “resnet”. I gave a resnet talk recently on the topic of neural network dropout.

I spent quite a bit of time reviewing neural network fundamental concepts: the input-output mechanism, and the back-propagation training algorithm. Then I discussed the dropout technique where, as each training item is presented, a random 50% of the hidden nodes are selected and dropped as if they weren’t there.

This technique in effect samples sub-networks and then averages them together. The main idea is very simple, but like always with neural networks, there are many subtle details.

Also, when I gave my presentation, I tried to add peripheral information about the history and development of the technique, and a bit about the psychology that’s associated with machine learning research.

I gave the audience a few challenges. When the nodes to drop are selected, they’re always (in every example I’ve ever found anyway) selected randomly:

for-each hidden node
  generate a random probability between 0 and 1
  if p < 0.50 make curr node a drop node
end for-each

But this approach doesn’t guarantee that exactly half of the hidden nodes will be selected — if you have four hidden nodes you might get 0, 1, 2, 3, r 4 drop nodes. So the challenge was to write selection code that guarantees exactly half of the nodes are selected.

If using the Python language, one way to do this would be to use the random.sample() function. For example:

import random

print("\nBegin \n")
random.seed(0)  # make reproducible

indices = list(range(0,10))  
print(indices)  # [0, 1, . . 9]

selected = random.sample(indices, 5)
print(selected)  # 5 random indices

print("\nEnd \n")

I pointed out that, to the best of my knowledge, nobody has investigated and published an analysis if the two selection approaches give essentially the same results on neural network prediction accuracy.

Posted in Machine Learning | 1 Comment