Running the ML.NET Quick Start Tutorial

Before I write anything else, let me say Bravo! At last somebody (or group of people) created a quick start for a new technology, and the quick start is perfect — straight to the point and it works first time.

OK, so what exactly impressed me? I took a look at the brand new ML.NET which is a machine learning code library for software developers who use .NET technologies. The ML.NET library is based on an internal Microsoft library named TLC. TLC has been around for years (the current version inside Microsoft is 3.9) a TLC code is used in many existing Microsoft products and services.

I decided I’d take a look at the early ML.NET documentation at Most documentation is horrible so I was mentally prepared for a bad experience, but as I mentioned, the documentation was excellent.

The quick start uses the new .NET Core which is a software ecosystem similar to the .NET Framework, but .NET Core is ideal for console (shell) applications including ML.NET applications. First, I downloaded and installed .NET Core onto my machine.

Next I opened a command shell and created a new console application.

Next, I created the data file.

Next, I copy-pasted the ML.NET C# program.

And last, I ran the program.

OK, so there’s a lot I don’t understand yet, but the point of a quick start is to just get started. The rest is relatively easy.

The start of the 1966 Le Mans race where drivers sprint to their cars. Ford GT40s took first, second, and third places, ending five years of Ferrari wins. The golden age of motor sports.

Posted in Machine Learning | Leave a comment

Accuracy, Precision, Recall, and F1 Score

If you have a binary classification problem, four fundamental metrics are accuracy, precision, recall, and F1 score. They’re best explained by example. Suppose the problem is to predict if a sports team will win or lose. There are four possible scenarios:

1. you predict the team will win and they do ("true positive")
2. you predict the team will win but they don't ("false positive")
3. you predict the team will lose and they do ("true negative")
4. you predict the team will lose but they don't ("false negative")

Suppose you make 100 predictions for different games and your results are:

TP = 40 (correctly predicted a win)
FP = 20 (incorrectly predict a win)
TN = 30 (correctly predicted a loss)
FN = 10 (incorrectly predict a loss)

The four metrics are:

1. accuracy = num correct / (num correct + num wrong)
            = (TP + TN) / (TP + FP + TN + FN)
            = 70 / 100
            = 0.70

2. precision = TP / (TP + FP)
             = 40 / (40 + 20)
             = 40 / 60
             = 0.67

3. recall = TP / (TP + FN)
          = 40 / (40 + 10)
          = 40 / 50
          = 0.80

4. F1 score = 1 / [ ((1 / 0.67) + (1 / 0.80)) / 2 ]
            = 1 / [ (1.50 + 1.25) / 2 ]
            = 1 / (2.75 / 2)
            = 1 / 1.375
            = 0.73

Accuracy is intuitive, and in my opinion, the single most important metric. Precision and recall are very difficult for me to interpret intuitively, so I just think of them only as metrics where higher values are better. As precision increases, recall must decrease, and vice versa. The F1 score is the harmonic average of precision and recall, the idea being that it gives you a single combined metric. Therefore, for F1 scores, larger values are better. Notice that the F1 score of 0.73 is between the precision (0.67) and recall (0.80). You could use a regular average instead of a harmonic average, but because precision and recall are both proportions, a harmonic average in more principled.

The movie “Total Recall” (1990) starring Arnold Schwarzenegger and Sharon Stone, had fantastic special effects for the time in which the movie was made. But the plot had me very confused — I never really knew exactly who was good and who was bad, even at the end of the movie. I don’t like ambiguous movie endings. The remake in 2012 was just plain bad, bad, bad.

Posted in Machine Learning | Leave a comment

The Keras MNIST Example using Model Instead of Sequential

Just for fun, I decided to code up the classic MNIST image recognition example using Keras. I started by doing an Internet search. I found the EXACT same code repeated over and over by multiple people. The original code comes from the Keras documentation. I was stunned that nobody made even the slightest effort to add something new.

So, I figured I’d refactor the code to use the Model() approach rather than the Sequential() approach. The Sequential() approach creates a model like this:

model = Sequential()
model = model.add(Conv2D(32, input_shape=(28,28,1)))
etc, etc, for 8 layers

The Model() approach look like:

X = Input(shape=(28,28,1))
layer1 = Conv2D(32)(X)
layer2 = Conv@d(64)(layer1)
etc, etc
model = Model(X, layer8)

The exercise allowed me get insights into exactly how CNN image classification works using Keras. Like everyone I know, I learn by starting with some working code, often from documentation, but then the key is to experiment with the code.


from __future__ import print_function
import keras
import numpy as np
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D
from keras import backend as K

bat_size = 128
epochs = 3

(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train = x_train.reshape(60000, 28, 28, 1)
x_test = x_test.reshape(10000, 28, 28, 1)

x_train = x_train.astype(np.float32)
x_test = x_test.astype(np.float32)
x_train /= 255
x_test /= 255

y_train = keras.utils.to_categorical(y_train, 10)
y_test = keras.utils.to_categorical(y_test, 10)

# model = Sequential()
# model.add(Conv2D(32, kernel_size=(3, 3),
#                  activation='relu',
#                  input_shape=input_shape))
# model.add(Conv2D(64, (3, 3), activation='relu'))
# model.add(MaxPooling2D(pool_size=(2, 2)))
# model.add(Dropout(0.25))
# model.add(Flatten())
# model.add(Dense(128, activation='relu'))
# model.add(Dropout(0.5))
# model.add(Dense(num_classes, activation='softmax'))

X = keras.layers.Input(shape=(28,28,1))
layer1 = Conv2D(filters=32, kernel_size=(3, 3),
  activation='relu', padding='valid')(X)
layer2 = Conv2D(filters=64, kernel_size=(3, 3),
layer3 = MaxPooling2D(pool_size=(2, 2))(layer2)
layer4 = Dropout(0.25)(layer3)
layer5 = Flatten()(layer4)
layer6 = Dense(128, activation='relu')(layer5)
layer7 = Dropout(0.5)(layer6)
layer8 = Dense(10, activation='softmax')(layer7)

model = keras.models.Model(X, layer8)

  metrics=['accuracy']), y_train, batch_size=bat_size,
  epochs=epochs, verbose=1)

score = model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

Hogwart’s school made entirely from matchsticks.

Posted in Keras, Machine Learning | 1 Comment

Introduction to DNN Image Classification Using CNTK

I wrote an article titled “Introduction to DNN Image Classification Using CNTK” in the July 2018 issue of Microsoft MSDN Magazine. See

Image classification is a standard problem in machine learning. If you have an image (typically a photograph), the goal is the image classify it as, for example, “dog”, “cat”, or “squirrel”.

The standard way to perform image classification is to use an exotic type of neural network called a convolutional neural network (CNN). But until just a few years ago, the most common approach was to use a standard deep neural network (DNN). In my article I showed how to use the older DNN approach.

For my example, I used the well-known MNIST (modified National Institute of Standard and Technology) dataset. It consists of 60,000 training images and 10,000 test images. Each image is a picture of a handwritten digit, ‘0’ through ‘9’. Each image is small — 28×28 grayscale pixels (each pixel value is 0 to 255). So the goal is to accept a 28×28 image and classify it as a “zero” or a “one” or a “two”, etc.

Instead of creating a custom image classification system using a DNN or a CNN, you can use pre-trained image models from companies such as Google or Microsoft.

Using a generic image classification model may not give you optimal results. Google found this out when their image classification system was classifying some people as gorillas. According to Wired Magazine, Google’s image search function now refuses to return any results for “gorilla” or “monkey”.

Posted in CNTK, Machine Learning | Leave a comment

Why a Neural Network is Always Better than Logistic Regression

Logistic regression is a technique that can be used for binary classification — making a prediction when the thing to predict can be one of just two possible values. For example, you might want to predict if a person is male (0) or female (1) based on age, annual income, height, weight, and so on.

A neural network is more complex than logistic regression. And, as I show in the diagram below, logistic regression is a subset of a neural network classifier. To cut to the chase, you can simulate a logistic regression model using a neural network with one hidden node with the identity activation function, and one output node with zero bias and logistic sigmoid activation.

The moral of the story is that, in principle, anything you can do with logistic regression you can do with a neural network. Therefore, theoretically, a neural network is always better than logistic regression, or more precisely, a neural network can do no worse than logistic regression.

In the diagram, there are three input values (1.0, 2.0, 3.0). The logistic regression model on the left emits output value 0.5474 and so does the neural network model on the right. For the male-female example, the prediction would be female because the output value is greater than 0.5 (if the value was less than 0.5 the prediction would be male).

Now all of this is “in theory” and “in principle”. In practice, a neural network model for binary classification can be worse than a logistic regression model because neural networks are more difficult to train and are more prone to overfitting than logistic regression. That said however, the bottom line is that when doing binary classification, using a neural network is better in most cases than using logistic regression. And if you’re careful, you should be able to get better results with a neural network.

Posted in Machine Learning

The bAbI Dataset

The bAbI (pronounced “baby”) dataset is a collection of tasks intended for use by researchers who work with natural language processing, in particular “QA” which means question-and-answer (not “quality assurance”).

Here’s an example of bAbI called a “single supporting fact” task:

1 Mary moved to the bathroom.
2 John went to the hallway.
3 Where is Mary? 	bathroom	1
4 Daniel went back to the hallway.
5 Sandra moved to the garden.
6 Where is Daniel? 	hallway	4
7 John moved to the office.
8 Sandra journeyed to the bathroom.
9 Where is Daniel? 	hallway	4
10 Mary moved to the hallway.
11 Daniel travelled to the office.
12 Where is Daniel? 	office	11
13 John went back to the garden.
14 John moved to the bedroom.
15 Where is Sandra? 	bathroom	8

The number after the answer to a question is the number of the statement that’s needed to answer the question.

Here’s an example of a “counting” task:

1 Mary moved to the bathroom.
2 Sandra journeyed to the bedroom.
3 John went to the kitchen.
4 Mary took the football there.
5 How many objects is Mary carrying? 	one	4
6 Sandra went back to the office.
7 Daniel went back to the office.
8 How many objects is Mary carrying? 	one	4
9 John moved to the bedroom.
10 Sandra moved to the garden.
11 How many objects is Mary carrying? 	one	4
12 Mary travelled to the garden.
13 Mary went to the hallway.
14 Sandra journeyed to the bedroom.
15 Mary dropped the football.
16 How many objects is Mary carrying? 	none	4 15
17 Mary got the football there.
18 Daniel travelled to the garden.
19 How many objects is Mary carrying? 	one	4 15 17

So the idea is to create models that can answer the questions and give an explanation.

As the name “bAbI” suggests, these are intended to be simple, not entirely realistic problems. The idea is that if researchers use a common set of tasks like bAbI, they’ll be able to compare results more easily.

By the way, I sent an email message to the authors of bAbI, asking them about the origin of the name “bAbI”. Antoine Bordes replied quickly and courteously — “bAbI” is not an acronym.

You can read more about bAbI at

Baby, “Sucker Punch” (2011). Baby, “Baby Driver” (2017). Baby, “Dirty Dancing” (1987). Of these three movies, I liked “Baby Driver” the best.

Posted in Machine Learning

Machine Learning with IoT Devices on the Edge

I wrote an article titled “Machine Learning with IoT Devices on the Edge” in the July 2018 issue of Microsoft MSDN Magazine. See

There’s no standard definitions for any of the terms in my title. Machine learning is some software system that makes a prediction. An IoT device is, well, basically any kind of hardware device but usually something small with limited power and memory. The Edge is anything that’s connected to the Cloud but not actually part of the Cloud.

In my article I used an example of a small device similar to a Raspberry Pi. You want to run ML software on the device. In most cases you can’t install a lot of heavyweight software on an IoT device so you have to somehow get the ML model (the part that just does the prediction, as opposed to the part that learns/creates the model) onto the device.

I describe two approaches. The first is to write custom C++ code for the device. This works reasonably well for simple ML model like a single-hidden-layer neural network. But it doesn’t work well for large complex models.

The second approach is a look ahead to a system under development at Microsoft called ELL (embedded leaning library). It’s kind of like a cross-compiler for ML models that does lots of clever optimization to shrink a standard ML model down to a size feasible for a tiny IoT device.

The moral of the story is that all of these efforts to get AI/ML intelligence into IoT devices, are very early in development. But there are a lot of companies working very hard and things will happen very quickly over the next couple of years.

Posted in Machine Learning