Using a CNTK LSTM Network with Word2Vec

I successfully implemented an LSTM network using CNTK with Word2Vec embeddings. Let me explain. I started with a paragraph of the Sherlock Holmes novel “A Study in Scarlet”. The first couple of sentences (converted to lower case, punctuation removed) are:

in the year 1878 i took my degree of doctor of medicine of the
university of london and proceeded to netley to go through the
course prescribed for surgeons in the army having completed my
studies there i was duly attached to the fifth northumberland
fusiliers as assistant surgeon

My goal was to create a prediction model — given N words, what is the next word? For example, if N = 4 and the input sequence is “year 1878 i took” then the model should predict “my”. First I converted all the words to index values: “in” = 0, “the” = 1, “year” = 2, and so on. Now in theory I could have used these index values directly. But a much better approach is to covert each word/index to a numeric vector of floating point values.

This approach is called an embedding. I used the Word2Vec tool to create the embeddings for each of the 86 distinct words in the source text. I set the vector length to 32. The result for “the” was

Vector for 'the' is:
[ 3.0290568e-03  1.1347506e-02  2.5496054e-03 -1.3096497e-02
 -5.7233768e-03  9.1301277e-03 -2.6647178e-03  1.2957667e-02
 -3.7651435e-03 -1.0592117e-02 -6.0152885e-05  8.1940945e-03
 -1.1889883e-02 -1.5280096e-02  4.6902723e-03 -1.0119098e-02
 -1.0269336e-02 -9.8525938e-03 -8.9324228e-03  1.3820899e-02
  8.8472795e-03 -1.0620472e-02  1.3961374e-03  1.3016418e-02
 -9.3864333e-03 -1.1885420e-02  7.3955222e-03  1.3285194e-02
  1.1789358e-02  8.3396314e-03 -8.4532667e-03 -4.6083345e-03]

Next I created a data file for a CNTK network. The data file looked like:

0 |curr -0.86816233  0.28763667 . .  1.50366807 |next 4:1
0 |curr  0.30290568  1.13475062 . . -0.46083345 
0 |curr -0.65285438 -0.69098999 . .  1.46716731
. . . 

By the way, figuring out each of these steps was rather difficult and each took several days of work. I was stuck on the CNTK sequence format until I got some valuable information from a colleague, William Darling. Without that key information, I’d still be stuck.

I’m leaving out tons of details. For example, CNTK has a built-in Embedding layer you can use instead of Word2Vec embeddings. And the built-in Embedding layer can accept a text file of the Word2Vec vector values. And many other details.

With my data ready at last, I ran a program to train the model and it failed spectacularly until I noticed the Word2Vec vector values were very small (like 0.0001234), and so I scaled them up by multiplying by 100 (see the data snippet above).

Finally, after weeks of work, I was able to create an LSTM network model of the first paragraph of “A Study in Scarlet” using Word2Vec embeddings.

Check out William’s excellent video about machine learning for sequences at

Posted in CNTK, Machine Learning | Leave a comment

Establishing Baseline Accuracy for a Time Series Regression Problem

The goal of a time series regression problem is to predict the next value given a sequence of input values. A typical example would be predicting a company’s next-month sales figure from the previous three months sales figures.

Time series regression problems are easy to understand but they are among the most difficult types of problems in all of machine learning. There are many different techniques you can use for time series, and just the fact that there are so many different techniques is an indication of how difficult these problems are.

When trying to create a time series regression prediction model, during model training it’s up to you to define model accuracy. Put another way, you must define what it means for a prediction to be correct — how close is close enough to be considered correct? The most common approach is to pick a threshold percentage, such as 10% and then count a prediction as correct if the prediction is within plus or minus the threshold percentage.

So suppose you create a time series regression model and using a 10% threshold, your model has 85% prediction accuracy. In order for you to interpret what this means you need to have a baseline accuracy to compare against. The easiest way to establish a baseline accuracy for a prediction model is to use the simplest possible prediction strategy which is to predict that the next number in the series is just equal to the previous number.

For example, the well-known Airline Passengers dataset looks like:

. . .

The first value means that there were 112,000 airline passengers in January 1949. To compute baseline accuracy with a threshold of 10%, you walk through the data and imagine that the predicted value is the current value, and the actual value is the next value:

predicted  actual  diff  correct?
112        118       6    yes
118        132      14     no
132        129       3    yes
. . .

Using this technique, the baseline accuracy with a 10% threshold is 51.05% accuracy (73 correct, 70 wrong). Therefore any prediction model that has better than 51% accuracy is an improvement on the baseline prediction technique.

Baseline Road marks the southern edge of the University of Colorado at Boulder. Beautiful campus, great spirit at football games. I love the St. Julien Hotel, just north of the school.

Posted in Machine Learning | 1 Comment

Datasets for Binary Classification

The goal of a binary classification problem is to create a machine learning model that makes a prediction in situations where the thing to predict can take one of just two possible values. For example, you might want to predict whether a person is male (0) or female (1) based on predictor variables such as age, income, height, political party affiliation, and so on.

There are many different techniques you can use for a binary classification problem. These techniques include logistic regression, k-NN (if all predictors are numeric), naive Bayes (if all predictors are non-numeric), support vector machines (rarely used any more), decision trees and random forest, and many others. My favorite technique is to use a standard neural network.

If you want to explore binary classification techniques, you need a dataset. You can make your own fake data, but using a standard benchmark dataset is often a better idea because you can compare your results with others.

Here’s a brief description of four of the benchmark datasets I often use for exploring binary classification techniques. These datasets are relatively small and have all or mostly all numeric predictor variables so none, or not much, data encoding is needed.

1. The Cleveland Heart Disease Dataset

There are 303 items (patients), six have a missing value. There are 13 predictor variables (age, sex, cholesterol, etc.) The variable to predict is encoded as 0 to 4 where 0 means no heart disease and 1-4 means presence of heart disease. See Sample:

63.0,1.0,1.0,145.0, . . 6.0,0
67.0,1.0,4.0,160.0, . . 3.0,2
67.0,1.0,4.0,120.0, . . 7.0,1
. . .

2. The Banknote Authentication Dataset

There are 1372 items (images of banknotes — think Euro or dollar bill). There are 4 predictor variables (variance of image, skewness, kurtosis, entropy). The variable to predict is encoded as 0 (authentic) or 1 (forgery). See Sample:

. . .
. . .

3. The Wisconsin Cancer Dataset

There are 569 items (patients). There is an ID followed by 10 predictors variables (thickness, cell size uniformity, etc.) The variable to predict is encoded as 2 (benign) or 4 (malignant). See Sample:

. . .
. . .

4. Haberman’s Survival Dataset

There are 306 items (patients). There are three predictor variables (age, year of operation, number nodes). The variable to predict is encoded as 1 (survived) or 2 (died). See Sample:

. . .

Here are some well-known datasets that I don’t like to use:

The Adult dataset to predict if a person makes more than $50,000 per year or not (see ) is popular but it has 48,842 items and eight of the 14 predictor variables are categorical.

The Titanic dataset (did a passenger survive or not – see ) is popular but requires you to sign up with Kaggle and get annoying messages, and the dataset has been pre-split into training and test sets which isn’t always wanted.

The Pima Indians Diabetes (woman has diabetes or not – see ) dataset is popular, but the dataset makes no sense to me because some of the predictor variables have a value of 0 in situations where that is biologically impossible.

Binary star system GG Tauri-A

Posted in Machine Learning | Leave a comment

I Give a Workshop on Introduction to Neural Networks using CNTK v2

I recently gave a one-hour workshop on neural networks using CNTK v2. CNTK is an open source code library that can be used to create neural networks. Getting started with CNTK can be a bit difficult.

For my example, I used the famous Iris Dataset where the goal is to predict the species of an iris flower (setosa, versicolor, virginica) based on four predictor variables (sepal length, sepal width, petal length, petal width). When the goal is to predict a discrete label, as in this problem, you’re doing classification, as opposed to when the goal is to predict a numeric value, which is called regression.

I noticed that even though my example is just about the simplest possible problem, there was still a ton of information to present. By that I mean, when you know some topic well, it’s easy to forget all the things you had to learn to get to that point. One of the reasons I enjoy teaching and training is that by going over the details of some topic, I learn something new or gain a new insight no matter how many times I’ve looked at a topic.

In the workshop I also showed exactly how to install CNTK. In fact, I took a chance and before the workshop, I had removed all my CNTK-related programs and flattened my machine. So I did a live install of CNTK — which was really just begging for trouble. But the installation went smoothly. I showed a live installation because, to me anyway, when learning a new programming language or technology or tool, there’s nothing more frustrating than not being able to get started because of some sort of installation problem.

Posted in CNTK, Machine Learning | Leave a comment

The Kolmogorov-Smirnov Test Simple Example

The Kolmogorov-Smirnov (KS) test is a classical statistics technique that can be used to compare a set of observed values with a set of expected values, or compare a set of values with a known distribution.

Suppose you have n = 8 movie ratings where each rating is a number between 1.0 and 5.0 — (1.2, 2.3, 2.4, 2.6, 2.7, 2.9, 3.8, 4.6). It looks like the ratings are low. Is there statistical evidence that the ratings are not evenly (uniform) distributed?

The key idea of KS is to compare observed with expected, but you compare observed cumulative frequencies with expected cumulative frequencies. I constructed this table:

rating    obs  exp  co  ce   cof   cef 
1.0 - 1.5  1    1    1   1  .125  .125
1.5 - 2.0  0    1    1   2  .125  .250
2.0 - 2.5  2    1    3   3  .375  .375
2.5 - 3.0  3    1    6   4  .750  .500 <- .250
3.0 - 3.5  0    1    6   5  .750  .625
3.5 - 4.0  1    1    7   6  .875  .750
4.0 - 4.5  0    1    7   7  .875  .875
4.5 - 5.0  1    1    8   8  1.00  1.00

First, because there are 8 observation values, I divided the ratings into 8 ranges. The obs column is the observed frequency (number of ratings) in each rating range. The exp is the expected number of ratings in each range if the ratings are evenly distributed — 1 rating in each range. The co is the cumulative count (running total) of observed ratings — because KS works with cumulative frequencies. The ce is the cumulative expected count in each range.

The cof and cef columns are the cumulative observed frequencies and the cumulative expected frequencies — which is just the previous two columns divided by 8.

Now, for KS, you find the largest difference between cumulative observed frequency and cumulative expected frequency. In this example the largest difference is 0.250. Now you look up the so-called critical value of KS for n = 8 from a statistics reference. The critical value, for a 5% significance level for n = 8 is 0.4096. Because the calculated KS statistic of 0.250 is less than the critical value, we conclude there isn't enough evidence to say that the ratings aren't evenly distributed. (Very tricky to phrase.)

If the calculated KS statistic had been greater than 0.4096 we could have concluded that there's evidence (at a 5% significance level) that the movie ratings are not evenly distributed.

There are many details to the KS test, but this blog post should give you a start. In particular, KS is often used to infer if a set of data is Normal (bell-shaped curve) distributed. The tricky part here is calculating the expected frequencies.

In a variation of the KS test, you compare two sets of values to determine if the come from the same distribution. For the example above, if some values were at the midpoint of each rating range, they'd be: (1.25, 1.75, 2.25, 2.75, 3.25, 3.75, 4.25, 4.75). Using SciPy, I ran a two-sample KS test and got the same results (the 0.928954777402 is the probability the two sets of numbers come from the same distribution, so again there's not enough evidence to say the ratings aren't uniformly distributed).

The Kolmogorov-Smirnov test is similar in some respects to the chi-square goodness of fit test. However, the chi-square test works directly with observed and expected counts, not cumulative frequencies.

Classical statistics techniques like KS are primitive and almost laughably crude compared to modern machine learning techniques. But classical statistics can still be useful every now and then.

In honor of National Women’s Month I did an Internet search for terms related to goodness of fit and famous women. I got results pointing to Seattle business woman Lou Graham who ran a highly successful gentleman’s club in the early 1900s. See

Posted in Miscellaneous | Leave a comment

Still Yet Another Look at LSTM Time Series Regression

One of my personality weaknesses is that when a technical problem gets stuck in my head, I’m incapable of letting it go. Literally. I can think of several problems that stuck in my brain for several years until I finally solved them.

Time series regression (predicting the next number in a sequence) using an LSTM neural network (a very complex NN that has a memory) is one of these problems. This weekend I made a step forward in fully understanding LSTM time series regression. In particular, I figured out one reason why these problems are so difficult.

My usual example is the Airline Dataset. There are 144 values which represent the total number of international airline passengers each month, from January 1948 through December 1960. The data looks like:

1.12, 1.18, 1.32, 1.29, 1.21, 1.35 . . 4.32

The raw values are x100,000 so the first value means 112,000 passengers. In many of my attempts to predict the next number in the sequence. I thought I saw a phenomenon where all my predictions were off by one month. For example, suppose the input is a set of size four values. I saw things like:

Input                    Actual Predicted
1.12  1.18  1.32  1.29    1.21    1.28    
1.18  1.32  1.29  1.21    1.35    1.20
1.32  1.29  1.21  1.35    1.48    1.34

In other words, the model was predicting not the next value, but rather the last value of the current input. I figured I had just made some sort of indexing mistake because by shifting the predicting values up by one position, the predictions become very accurate. But I was wrong about that.

I now believe this effect is a fundamental problem with LSTM time series regression. An LSTM uses as its input, the new input and also part of the previous output. If the LSTM is ineffective, it could be using just the previous output, which because of the way rolling window data is set up, would give the bad results above. (Note: my explanation here isn’t fully correct — a complete explanation would take a couple of pages).

Put another way, time series regression problems with an LSTM appear to be extremely prone to a form of over-fitting related to rolling window data.

I coded up yet-another-example using the CNTK library. By adding a dropout layer I was able to lessen the effect of this over-fitting. There’s still much more to understand in my search for truth. My current working hypothesis is that LSTMs for time series regression can work well for modeling the structure of a dataset (which can be used for anomaly detection), or for predicting a very short time ahead (perhaps one or two time steps), but not for extrapolating several time steps ahead.

Apollo 11 flight to the moon in 1969 – a brilliant conclusion to a search for scientific truth and an incredible achievement by men who are true heroes.

Posted in CNTK, Machine Learning | 2 Comments

Sentiment Analysis using an LSTM Neural Network

I found an excellent example of building a sentiment analysis prediction model using an LSTM neural network with Keras. The problem is to take the text of several thousand movie reviews from the IMDB Web site that have been marked as either good, bad, or neutral (by the star rating) and create a model that uses review text to predict if the review is good or bad (neutral reviews are thrown out).

The example code I found was on a blog site at I made a few minor changes so I could understand what each part of the code does.

In the abstract, the training data is

w1, w2, w3, . . . wn -> “good”
w1, w2, . . wm -> “bad”

Each w is a word in one or more sentences. An LSTM neural network is designed to handle problems like this where the input is a sequence of related data. The demo code uses the Keras library which is by far the simplest way to implement an LSTM network (at the expense of flexibility).

The demo data is 25,000 reviews marked as good or bad to be used for training, and 25,000 labeled reviews for testing. The demo code achieves 88.40% accuracy on the test items (22,100 correct and 2,900 wrong) which is quite good.

Even though the demo program is very short, conceptually here’s a lot going on. Very cool example.

The Cinerama Theater in downtown Seattle was built in 1963. In the late 1990s it was in shambles and scheduled for demolition. But very-rich-person Paul Allen did a complete renovation and the theater is now state-of-the-art. Hats off to him!

Posted in Keras, Machine Learning | Leave a comment