A First Look at the CNTK v2.0 Release Candidate Machine Learning Library

Microsoft CNTK (Microsoft Cognitive Toolkit) is a powerful code library that can be used for many machine learning tasks. A few days ago, CNTK v2.0 Release Candidate 1 became available.

Version 2 is a huge change from version 1 — the versions are so different from a developer’s perspective, that I consider CNTK v2 to be an entirely different library. CNTK v2 is written in C++ but has a Python API interface because nobody wants to torture themselves by writing C++ code unless necessary.

So, I rolled up my developer’s sleeves and dove in. Because I had an old CNTK v2 Beta, I first removed it by 1.) Using the Control Panel to uninstall the Anaconda Python distribution, 2.) Deleting all references to CNTK and repos from my System Environment Variables, and 3.) Deleting the old install directory (C:\local).

With a clean system, I first installed the required Anaconda version 4.1.1 64-bit with Python 3. After verifying Python 3.5 was installed, I used pip to install CNTK by opening a command shell and typing (the URL is really long so I put a space after each slash for readability):

pip install https://cntk.ai/ PythonWheel/ CPU-Only/ 

And the installation just worked. Nice! I verified CNTK was alive by typing:

python -c "import cntk; print(cntk.__version__)"

and CNTK responded by displaying his 2.0rc1 version.

Next I took a CNTK script that I’d written for CNTK v2 Beta and tried to run it. I immediately got lots errors but they weren’t too hard to fix — mostly package name changes.

My script creates a simple, single-hidden-layer neural network and creates a model that can make predictions of the famous Iris Dataset.

CNTK is a very powerful library and has a double, steep learning curve. Because it works at a relatively low level, you must have a good grasp of things like neural network architecture and concepts such as back-propagation. And you must have intermediate or better Python skill. And then learning the library itself is quite difficult.

But the payoff is a very powerful, very fast machine learning library.

Posted in Machine Learning | 1 Comment

A Recap of Science Fiction Movies of 2016

Now that 2016 is well over, I’ve had a chance to see most of the main science fiction films from 2016. It was an uneven year with a few decent films but some real clunkers too. Here are my top ten science fiction films from 2016.

1. Dr. Strange – Most people don’t classify this as a science fiction film, but for me the introduction of the multiverse makes it so. And besides this is my blog so I can do what I want. This movie was a pleasant surprise – good effects, good acting, good pace, decent plot. I give Dr. Strange a solid A- grade.

2. Rogue One: A Star Wars Story – Another pleasant surprise, mostly because of the absolutely awful “Star Wars: The Force Awakens” from 2015. I think Rogue One is the second best of all the Star Wars films, ranking behind only Star Wars Episode IV – A New Hope (the first one, released in 1977). I grade Rogue One at a B+.

3. Star Trek Beyond – A third nice surprise. This is just a good, plain old action film with a nice balance of plot, effects, and character development. The crew of the Enterprise is lured into a trap on Altamid. No academy award winner, but a good film and I give it a B+ grade.

4. Arrival – I had high hopes for this aliens-come-to-earth film, and it’s pretty good, but I somehow get the feeling it could have been a bit better. I liked the realistically portrayed effort to communicate with the aliens, but the film was a bit too slow and disjointed. I also liked the twist ending. Grade = B.

5. Spectral – I liked this film much more than my friends. Soldiers fight ghost like creatures that turn out to be 3D printed semi-humans. Clever idea and good action sequences. Grade = B-.

6. 10 Cloverfield Lane – John Goodman kidnaps a couple of young people to protect them in his bomb shelter from an alien invasion. Or is he just crazy? Good premise but this film is really slow. Grade = C+.

7. Passengers – The biggest disappointment of 2016 for me. One of those films where the whole is less than the sum of the parts. Some films just don’t quite work. This is one. Jennifer Lawrence and Chris Pratt are alone on a spaceship because they woke up too early from their 120-year trip. Not bad by any means but could have been much better. Grade = C.

8. Independence Day: Resurgence – Groan. Just a bad film which on the one hand is surprising because the original Independence Day was so good, but this film has sequel-itis. The aliens come back but things like plot continuity and story didn’t. More special effects isn’t always a good thing and this film is an example. Grade = D+

9. The 5th Wave – I foolishly had high hopes for this film. I don’t even know how to describe it — take the most annoying parts of any recent teen-based sci-fi film, mix them up, and you get The Fifth Wave. When this film was over I was glad to give it a wave goodbye. Grade = D.

10. The Divergent Series: Allegiant – Terrible film. Annoying film. Why did I torture myself? Ugh. Grade = F+.

Posted in Top Ten

Composable Recurrent Neural Networks

A basic neural network (NN) has no memory of previous inputs or outputs. This means a NN has great trouble predicting the next token in a sequence. For example, suppose you want to predict the next word in the sentence, “I like pasta so tonight I’ll eat (blank).” A reasonable prediction would be “spaghetti” but a basic neural network only sees one word at a time and probably wouldn’t be able to do well on this prediction problem.

In the 1990s a special type of NN called a recurrent neural network (RNN) was devised. Each input is combined with the output of the previous iteration. For example, when presented with the word “so” a RNN will remember the output from the previous input, “pasta”. In this way each input has a trace of memory of previous outputs.

To implement a simple RNN isn’t too difficult conceptually, but it’s quite a chore in practice. One engineering strategy is to create a composable module that can be chained together. Expressed as a diagram:

Here Xt is the current input (typically a word or a letter). The Yt is the output (a vector of probabilities that represent the likelihood of each possible next word). The box labeled tanh is a layer of hidden processing nodes and a layer of output nodes — essentially a mini neural network. But notice that just before the Xt input reaches the tanh internal NN the output from the previous item is concatenated to the input.

Once such a module has been implemented, you can chain them together like this:

This is pretty cool. However, in practice, these very simple RNNs just don’t perform well. The main problem is that they just can’t remember enough. This gave rise to more sophisticated forms of RNNs, in particular the oddly-named but very effective “long, short-term memory” network (LSTM) network.

So why even bother with the simple RNNs? Because in order to implement LSTMs and even more exotic networks, understanding basic RNNs is a good way to start. So that’s what I’m doing.

Posted in Machine Learning

Kernel Perceptrons using C#

I wrote an article titled “Kernel Perceptrons using C#” in the April 2017 issue of MSDN Magazine. See https://msdn.microsoft.com/en-us/magazine/mt797653.

A kernel perceptron is a machine learning technique that can be used to make a binary prediction — one where the thing-to-be-predicted can take on just one of two values. For example, you might want to predict if a person is Male (-1) or Female (+1) based on Age, Income, and Education.

Ordinary perceptrons are really just a curiosity because they can only predict in situations where you have what’s called linearly separable data (you can draw a straight line to separate). But by applying the “kernel trick” you can create perceptrons that can handle more complex data like this:

The kernel trick is based on a so-called kernel function. There are many such functions, but the most common is called the radial basis function (RBF) kernel. RBF is a measure of similarity between two numeric vectors where an RBF(v1, v2) = 1.0 indicates the vectors are equal, and smaller values, approaching 0.0, indicate more different.

Briefly, to make a binary prediction, a kernel perceptron computes the RBF similarity between the item to be predicted and all training items (data with known input values and known, correct classification values) and aggregates those RBF similarity values to make a prediction.

Sadly, kernel perceptrons are now just curiosities because there are more powerful techniques, notably binary neural network classifiers, and kernel logistic regression. But kernel perceptrons are a good introduction to the math and ideas of kernel methods in general.

Posted in Machine Learning

Bingo Pinball Machines

I love old electro-mechanical devices. Years ago my college roommate Ed and I enjoyed playing “bingo pinball machines” that were manufactured in the 1950s. The machine’s top screen has a 5 by 5 bingo-like grid with the numbers 1-25 in a scrambled order. You shoot five metal balls, one at a time, and try to get three or four or five numbers in a row. If you did, the machine paid out nickels like a slot machine.

These old devices were complex mechanically and many were really beautiful works of mechanical art. I discovered a fantastic simulation Web site (link below) and was able to relive some of the fun Ed and I had with these machines. Here’s an example session playing “Beach Time”, a typical bingo pinball game.

1. To get started, I put in a bunch of (virtual) nickels. After each nickel, zero, one, or more features would light up. After about 30 nickels I reached the image below (you can click on it to enlarge). The two key items are the payout tables at the bottom and the lighted letters, A through F. If I get 3, 4, or 5 in a row on a red line, my payoffs are 450, 240, 120 respectively. My payouts for yellow (200, 96, 32) and green (300, 144, 64) lines are a bit less.

2. I shot my first four balls and they landed in numbers 7, 13, 20, and 12 as shown below in the image on the left. Now comes the cool part. Notice that “Press Buttons Before Shooting 5th Ball” feature is lighted. I have the option of pressing A B, C, D, or F because they’re all lighted. The D grid has the numbers

16  13
 5  21

I pressed the D button three times. After each press the four numbers physically rotate one step clockwise so I positioned the 13 under the 7, giving me two chances to get three in a row (a 2 or a 16). Then I pressed the E button two times to position the lighted 12 under the 13, giving me a chance to get four in a row (if I got a 16). See the image below, on the right. Note that I didn’t press the F button but I should have because the 20 in the F section cannot be part of any three-in-a-row. I should have pressed F once to move the 20 down one position, giving me a chance for 12-8-20 or 13-5-20.

3. I shot my fifth and final ball and my faulty strategy worked because I got a 2, completing a 2-7-13 three-in-a-row on a green line. I won 64 virtual nickels, as shown in the counter in the upper left of the image below. Good fun!!

There is some fascinating combinatorial mathematics going on here. The positions of the 25 numbers on the display screen, combined with the positioning of the 25 holes on the playing surface are critically important with regards to the probabilities of payouts.

I’ve just touched on a tiny bit of these fascinating machines. If you want to learn more about bingo pinball machines, I suggest starting with:


You can download the excellent simulation program from:


Posted in Miscellaneous

The Difference Between Log Loss and Cross Entropy Error

If you’re new to neural networks, you’ll see the terms “log loss” and “cross entropy error” used a lot. Both terms mean the same thing. Multiple, different terms for the same thing is unfortunately quite common in machined learning (ML). For example, “predictor variable”, “feature”, “X”, and “independent variable” all have roughly the same meaning in ML.

For the rest of this post, I’ll call the idea I’m explaining cross entropy (CE). There are two scenarios. First, a general case, which is more useful to mathematicians. And second, a case specific to ML classification.

In the first, general scenario, CE is used to compare a set of predicted probabilities with a set of actual probabilities. For example, suppose you have a weirdly-shaped, four-sided dice (yes, I know the singular is “die”). Using some sort of physics or intuition you predict that the probabilities for the weird dice are (0.20, 0.40, 0.10, 0.30). Then you toss the dice many thousands of times and determine that the true probabilities are (0.15, 0.35, 0.15, 0.35):

predicted: (0.20, 0.40, 0.10, 0.30)
actual:    (0.15, 0.35, 0.15, 0.35)

Cross entropy can be used to give a metric of your prediction error. CE is minus the sum of the log of predicted, times actual. If p is predicted probability and a is actual probability, then

So your CE is:

-( ln(0.20)*0.15 + ln(0.40)(0.35) +
   ln(0.10)(0.15) + ln(0.30)(0.35) )

= 1.33

Somewhat unusually, the CE for a prefect prediction is not 0 as you’d expect. For example, if your four predictions are (0.25, 0.25, 0.25, 0.25) and the four actuals are also (0.25, 0.25, 0.25, 0.25) then the CE is 1.39 (this CE is, not un-coincidentally the ln(4)).

Now in the case of ML classification, the predicted probabilities are values that sum to 1.0 but the “actual” probabilities all have the form of one 1.0 value and the rest 0.0 values. For example, suppose you are trying to predict the political party affiliation of a person and there are four possible values: democrat, republican, libertarian, other. These values would be 1-of-N encoded as democrat = (1,0,0,0), republican = (0,1,0,0), libertarian = (0,0,1,0), other = (0,0,0,1).

And suppose a neural network classifier emits a prediction of (0.50, 0.10, 0.10, 0.30) when the actual party is democrat. The cross entropy error for the prediction is:

-( ln(0.50) * 1 + ln(0.10) * 0  +
   ln(0.10) * 0 + ln(0.30) * 0 )

= 0.70

Notice that because of the 1-of-N encoding, there’s a lot of multiply-by-zero so all the terms in CE drop out except for one.

For this type of problem scenario, a perfect prediction does nicely give a cross-entropy error of 0 because ln(1.0) = 0.

As a final note, when coding cross entropy error, you have to be careful not to try and compute the ln(0.0) which is negative infinity.

Posted in Machine Learning

Conferences and Las Vegas

Many of the software development conferences I go to are in Las Vegas. More than any other city in the world, Vegas is set up to handle conferences of all types and sizes.

There are several Web sites that list upcoming conferences in Vegas. Before I go to Vegas to speak at an event, I often check one of these Web sites to see what other events will be happening at the same time.

Here are a few interesting events from my last scan of Vegas conferences.

1. Tortilla Industry Association Convention – This one is rather typical. If somebody makes or sell something, there’s usually an Association and they’ll have a convention.

2. Jobs for America’s Graduates Inc. Nevada JAG Trainig Seminar – I hope the “trainig” includes spelling.

3. SuperZoo West 2017 – Huh? Whatever it is, they’re expecting 20,000 people.

4. Pain Week 2017 – I think I’ll pass on this one.

5. National Association of Criminal Defense Lawyers DUI Meeting – Will attendees be driving to this event?

6. Breast Imaging A-Z – In Las Vegas, this could mean just about anything. . .

7. Automotive Aftermarket Week – Also known as SEMA, this event will have 160,000 attendees. I’ve been in Vegas during a SEMA and it’s every bit as chaotic as you’d guess.

8. Airlines for America Slot Exchange Meeting – Do passengers in Coach Class get slot machines or only First Class passengers?

9. World of Concrete – I think I might prefer the partner event, World of Abstract.

Las Vegas conferences – something for everyone.

Posted in Conferences, Top Ten