Recently, I had an interesting discussion with some colleagues about what are essential mathematics topics for machine learning engineering. Every now and then, in machine learning literature, the terms “first moment” and “second moment” will pop up. If you are an engineer and don’t know what these terms mean, you won’t understand the article.

In short, the first moment of a set of numbers is just the mean (that is, the average) and the second moment is usually just the variance. However, by themselves, the terms “first moment” and “second moment” are ambiguous. Let me explain.

Suppose you have four numbers (x0, x1, x2, x3). The first raw moment is (x0^1 + x1^1 + x2^1 + x3^1) / 4 which is nothing more than the average. For example, if your four numbers are (2, 3, 6, 9) then the first raw moment is (2^1 + 3^1 + 6^1 + 9^1) / 4 = (2 + 3 + 6 + 9) / 4 = 20/4 = 5.0.

In words, to compute the raw first moment of a set of numbers, you raise each number to 1 (which has no effect), sum, then divide by the number of numbers.

The second raw moment of a set of numbers is just like the first moment, except that instead of raising each number to 1, you raise to 2 (i.e., square). Put another way, the second raw moment of four numbers is (x0^2 + x1^2 + x2^2 + x3^2) / 4. For (2, 3, 6, 9) the second raw moment is (2^2 + 3^2 + 6^2 + 9^2) / 4 = (4 + 9 + 36 + 81) / 4 = 130/4 = 32.5.

There’s also a raw third moment (raise each number to 3), and raw fourth moment (raise each number to 4), and so on.

But. In mathematics, there’s always a “but”. In addition to the first and second raw moments, there’s also a central moment where before raising to a power, you substract the mean. For example, the second central moment of four numbers is [(x0-m)^2 + (x1-m)^2 + (x2-m)^2 + (x3-m)^2] / 4. For (2, 3, 6, 9), the second central moment is [(2-5)^2 + (3-5)^2 + (6-5)^2 + (9-5)^2] / 4 = (9 + 4 + 1 + 16) / 4 = 30/4 = 7.5 which is the population variance of the four numbers.

The first central moment of a set of numbers is, weirdly, always 0. For the four example numbers, the first central moment is [(2-5)^1 + (3-5)^1 + (6-5)^1 + (9-5)^1] / 4 = (-3 + -2 + 1 + 4) / 4 = 0/4 = 0.

To summarize, in machine learning, the term “first moment” often means the “first raw moment” (which is the mean) and the term “second moment” often means “the second central moment”, which is the variance. But not always. For example, the Adam optimization algorithm uses a first and second moment, but both moments are raw. When reading an article, and the difference matters, you need to ask which moment, raw or central, the author/person means.