The Maclaurin Series and Machine Learning

In the very early days of computers (say the 1950s and 1960s), most guys who entered the new field of “computer science” came from a background in either mathematics or electrical engineering. There’s always been a strong connection between mathematics and computer science. More specifically, with regards to machine learning, every now and then I’ll have a brief micro-discussion with colleagues about what math topics people who are new to ML should know.

For me, the answer isn’t large categories like “vector algebra”. I prefer to think in terms of very discreet, small topics. Understanding the Maclaurin Series is one mini-topic I think every person who works with ML should know.

The Maclaurin Series is a special case of the Taylor Series. Both are equations that can approximate some mathematical function. Put another way, in some ML scenarios, working directly with some function f(x) is very difficult, but working with the Maclaurin approximation to f(x), call the approximation P(x), is easier. The Maclaurin and Taylor series expansions pop up in several areas of ML, notably numerical optimization for ML training algorithms.

The Maclaurin approximation has a beautifully symmetric definition that uses the first, second, third, and so on, derivatives, and also the factorial function.

I prefer the full form of the approximation equation, but the simplified form, which uses the facts that 0! = 1! = 1, and x^0 = 1, and x^1 = x, is more common.

Here’s an example of approximating f(x) = (x+1)^(-1/2) using a second order Maclaurin series:

The approximation could be improved by adding more terms to the series expansion. And the approximation is only good for values close to x = 0. The Tayler Series generalizes the Maclaurin Series by using derivatives at any arbitrary value x = c.

Is the explanation I’ve provided enough to know about the Maclaurin Series for engineers who are learning ML? It depends, but I’d say the information in this blog post is a minimum, but valuable, amount of knowledge about the Maclaurin Series approximation. Note that in my post, I did not explain where the Maclaurin Series approximation equation comes from — the derivation is very, very beautiful and would be more or less required information for a math major, but probably a bit of overkill for most engineers who work with ML.

Posted in Machine Learning | Leave a comment

Neural Network Momentum using Python

I wrote an article titled “Neural Network Momentum using Python” in the August 2017 issue of Visual Studio Magazine. See https://visualstudiomagazine.com/articles/2017/08/01/neural-network-momentum.aspx

Momentum is a technique intended to speed up neural network training. Training a neural network is the process of determining the values of the weights and biases that essentially define the behavior of the network. The most common training algorithm is called back-propagation. Back-propagation is an iterative process which can take a very long time for complex neural networks.

The basic update for one weight is w = w + (-1 * lr * grad(w)). Put a bit differently:

delta = -1 * lr * grad(w)
w = w + delta

In words, the new weight value is the old value plus -1 times a small learning rate constant time the current gradient value of the weight. Th learning rate is a small constant, perhaps 0.01 but is determined by trial and error. The gradient is the Calculus derivative (just a number like -2.34) where the sign tells you if the weigh needs to increase or decrease and the magnitude influences how much the weight changes in one update.

Adding momentum is very easy and is:

delta = -1 * lr * grad(w)
w = w + delta + (mf * prev(delta))

In each weight update you add an additional term which is a momentum factor constant (typically something like 0.50) times the value of the delta from the previous update iteration.

I my article I go through the details of neural network momentum and give a complete demo program, written in Python, from scratch.

(See https://en.wikipedia.org/wiki/Momentum_(2015_film) )

Posted in Machine Learning | Leave a comment

Replicator Neural Networks

A standard neural network classifier builds a model that predicts output values from input values. For example, the famous Iris Data has 150 items. Each item has four predictor variables (sepal length, sepal width, petal length, petal width) followed by one of three species to predict: setosa encoded as (1,0,0), versicolor encoded as (0,1,0), and virginica encoded as (0,0,1). The first item in the set is:

5.1, 3.5, 1.4, 0.2, 1, 0, 0

You train the neural classifier to find the defining weight constants so that given an input set of four values, the model correctly predicts the species.

A replicator neural network builds a model that predicts its own inputs. This sounds strange at first, but I’ll explain the point shortly. For the Iris Data, you’d take the data for one of the three species (say, setosa), remove the encoded labels to predict. The idea is to feed the replicator NN the four inputs and have the model spit back the same four values. For example, conceptually, the first line of a training data file would be:

5.1, 3.5, 1.4, 0.2, 5.1, 3.5, 1.4, 0.2

The first four values act as inputs and the next set of four values act as the targets. Of course even though you could explicitly duplicate the outputs, there’s no need to do so because you can duplicate the values programmatically. because they’re the same.

(Click image to enlarge)

So, what’s the point? A replicator neural network can be used for anomaly detection. For example, if the data is some sort of network packet data, then you have tons of “normal” data. You create a replicator NN. Now when new data comes in, you pass it to the replicator. If the replicator NN doesn’t predict the packet data closely enough (defining what that means is the hard part), then the incoming packet might be malicious.

I coded up a short demo using raw Python. Good fun!

The moral of the story is that getting and using training data that is labeled (called supervised training) — and so has known correct output values — is time-consuming and difficult. Replicator NNs are an example of a machine learning technique that doesn’t need labeled data (unsupervised training).

Posted in Machine Learning | 1 Comment

Deal and Reveal Blackjack Again

Many of the technical conferences I speak at are in Las Vegas. Vegas is a great town for conferences because, well, the town is basically designed to accommodate thousands of people. Hotel rates in Vegas are very reasonable, air travel is easy and relatively inexpensive, and there’s lots to do if you enjoy observing people and mathematics like I do.

When I’m at an event in Vegas, I usually try to get away for an hour or two and cruise through the casino gambling areas. It’s not uncommon for me to see a new game — Vegas is relentlessly trying to find new ways to separate visitors from their money. There are many companies that design new games and then showcase the games at one of the two big casino conferences (the Global Gaming Expo, and the Table Games Conference).

Of the dozens and dozens of new games invented each year, only about two or three ever make it into a casino for a trial run of a few months so that the Nevada Gaming Commission and the casinos are satisfied that the new game makes money (casinos) but not too much money (Gaming Commission).

(See https://en.wikipedia.org/wiki/The_Card_Players )

While I was in Vegas for a conference recently, I walked through the Palazzo casino (connected to the Venetian where my conference was at) and I noticed a table game I hadn’t seen in several months. It’s a variation of Blackjack and is called “Deal & Reveal”. Briefly, thee game is much like regular Blackjack. Recall that you (the player) bet (say $25) and get two cards. The dealer gets two cards, one face down and one face up, so you know one of her cards. In Deal & Reveal, if the dealer’s up card is a 2, 3, 4, 5, or 6, then before you decide to hit or stand, she turns over her down car so you can see both cards! If the dealer’s up card is 7, 8, 9, 10, J, Q, K then she doesn’t do anything. I’ll explain when the dealer’s up card is an Ace in a moment. It would seem like this would give the player a big advantage, but surprisingly, seeing both of the dealer’s cards helps you a lot less than you’d expect.

An interesting detail is that when the dealer’s up card is an Ace, the dealer immediately checks to see if her down card is any ten, meaning she has Blackjack. Normally you’d lose (omitting the detail of Insurance) but in Deal & Reveal, if the dealer’s down card is any ten, she discards it and you get a second chance. This is psychologically very powerful, but again, mathematically it doesn’t help you as much as you’d think.

I’ve left out several important details. You can look the game up on the Internet or click on the image of the Rule Card I picked up to enlarge it so you can read it.

(Click on image to enlarge)

The moral here is for me only: My love of combinatorial math, probability, and computer science was ignited in part by my love of games such as poker and chess, when I was young. Las Vegas is a intriguing place for me because of the math and the psychology. I do have some minor qualms about the ethics of gambling but I think I’m over-sensitive to those kinds of issues. I have more fun analyzing the games than actually playing them. Usually.

Posted in Conferences, Miscellaneous | Leave a comment

Time Series Regression using a Raw Python Neural Network

I’ve been looking at time series regression recently. Just for fun I coded up an example using a raw Python (with the NumPy library for numerical functions) neural network. For my example I used a standard benchmark data set that has the total number of airline passengers for the 144 months from January 1949 through December 1960.

(Click image to enlarge)

I used a rolling window approach, with a window size of 4. This means that I used each consecutive four months to predict the next month. So the first data item is (1.12 1.18, 1.32, 1.29, 1.21). I normalized the raw data by dividing each passenger count by 100,000. So the first item means in months 1-4 there were 112,000, 118,000, 132,000, and 129,000 passengers. Those values are used to predict the passenger count for month 5, which is 121,000. The second item is (1.18, 1.32, 1.29, 1.21, 1.35) — the counts for months 2-5 are used to predict the count for month 6.

After I created my prediction model, I used it to print out the actual and predicted passenger counts. I dropped that data into Excel and made a graph. The model worked pretty well. Time series regression can be extremely complicated, but this was an interesting little exercise.

Posted in Machine Learning | Leave a comment

My Four Most Common Python NumPy Array Initializations

I use several different programming languages. Whenever I switch between languages, there’s always an adjustment time in my head. For some reason, whenever I switch from C# to Python with NumPy, it always takes me about an hour to start thinking fully in Python. In particular, it always takes me time to recall Python/NumPy array initializations.

One of the causes of this is that C# has basically two ways to instantiate an array:

double[] arr1 = new double[4];
double[] arr2 = new double[] { 1.0, 5.0, 2.0 };

But Python NumPy has many ways to instantiate. The ones I use most are often are np.zeros(), np.array(), np.full(), and a return from np.random.choice(). For example:

arr1 = np.zeros(shape=5, dtype=np.float32) # 5 0.0 cells
arr2 = np.array([17,2,5,0,5,12], dtype=np.int)
arr3 = np.array(range(0,5), dtype=np.int) # [0,1,2,3,4]
arr4 = np.full(shape=3, fill_value=0.01, dtype=np.float64)
arr5 = np.random.choice(7, 2) # 2 random ints between [0,6]
mat1 = np.zeros(shape=(2,3), dtype=np.float32) # 2x3 matrix

There are several implications. One is that I prefer programming languages that have sparse feature sets — I prefer to know everything about a small language. For example, the np.zeros() function is redundant in a sense because you can get the same effect using np.full(fill_value=0.0).

Another implication is the high cost of context switching when programming. It costs time and effort to switch languages (for example, working with C# on Monday, Wednesday, Friday, and with Python on Tuesday, Thursday). Or doing programming from 9:00 AM to 11:00 AM, then switching over to email tasks, then switching back to programming.

Posted in Machine Learning | Leave a comment

I do an Interview about Machine Learning on Microsoft’s Channel 9

Channel 9 is a Microsoft community video Web site. There are all kinds of interesting videos on Channel 9, but most of the videos are aimed at software developers.

I was recently asked to do a short (6-minute) interview on Channel 9. The topic was machine learning and the upcoming DevIntersection conference where I’ll be speaking about the Microsoft CNTK code library. See https://channel9.msdn.com/Shows/The-DEVintersection-Countdown-Show/DEVintersection-Countdown-Show-on-the-Opportunities-in-Machine-Learning-with-James-McCaffrey

The interview host was Richard Campbell. I’ve known Richard for a long time because we’ve both spoken at Microsoft conferences for quite a few years. Richard is a very bright guy, and as much as anyone I know, he has a really good understanding of the big picture of software development and technology. And he’s very articulate too — a relatively rare characteristic for deep technical experts.

Anyway, we chatted and I explained the differences between data science, machine learning, deep learning, and artificial intelligence. The video interview recording session took place at the Microsoft Production Studios in Building 25. The studios there are quite impressive and very professional.

Anyway, if you go to the DevIntersection Conference, October 31 through November 2, 2017, please seek me out before or after my CNTK talk and say “hello”. See the conference site at: https://www.devintersection.com.

Posted in Conferences, Machine Learning | Leave a comment