Decision Tree Classification

I’m not a big fan of decision tree classification. Although decision trees have a couple of advantages over neural network classifiers (simplicity, somewhat able to interpret), decision trees rarely work as well as neural networks, at least on the types of problems I deal with.

Here’s an example of a decision tree for the well-known Iris Dataset, using the scikit-learn library.


import numpy as np 
from sklearn import datasets 
from sklearn import tree 

def show_iris_tree(tree):
  # tree-interpretation-decision-rules-python.html 
  left = tree.tree_.children_left 
  right = tree.tree_.children_right 
  thresh = tree.tree_.threshold 
  features = ['sepal-len', 'sepal-len', 'sepal-wid', 'sepal-wid',
   'petal-len', 'petal-len', 'petal-wid', 'petal-wid']
  value = tree.tree_.value 

  def process(left, right, thresh, features, node, depth=0): 
    indent = "  " * depth 
    if (thresh[node] != -2): 
      print(indent, end="")
      print("if ( %s <= %0.4f ) {" % (features[node],\
      if left[node] != -1: 
        process(left, right, thresh, features, left[node],\
        print( indent,"} else {" ) 
        if right[node] != -1: 
          process(left, right, thresh, features, right[node],\
        print( indent,"}" )
      print( indent,"return " + str(value[node]) )

  process(left, right, thresh, features, 0) 

# ==============================================================

print("\nBegin Iris decision tree example \n")

print("Loading iris data into memory \n")
iris = datasets.load_iris() 
X = 
y = 

print("Creating decision tree max_depth=3")
tr = tree.DecisionTreeClassifier(max_depth=3), y)


print("\nEnd decision tree demo ")

So, no real moral to the story, just a little investigation during my lunch break.

“Olive Trees with Yellow Sky and Sun” (1889), Vincent Van Gogh

Posted in Machine Learning | Leave a comment

Why You Shouldn’t Apply Softmax Twice to a Neural Network

I was presenting a talk about the Microsoft CNTK neural network library recently. CNTK has a quirk that isn’t explained anywhere in the documentation. Briefly, unless you’re careful, when you train a CNTK neural network classifier, you will apply softmax activation twice. Your training will still work but training will be much slower than it should be, which means that your resulting model may not be as accurate as it should be.

Consider this (incorrect) CNTK code snippet:

h_layer = C.layers.Dense(hidden_dim,
  activation=C.ops.tanh, name='hidden_layer')(X)  
o_layer = C.layers.Dense(output_dim,
  activation= C.ops.softmax, name='out_layer')(h_layer)
model = o_layer
. . .
tr_loss = C.cross_entropy_with_softmax(model, Y)
. . .
(train model)
. . . 
(use model to make predictions)

In all neural libraries except CNTK, you apply softmax activation to the output layer to coerce the raw output values so that they sum to 1.0 and can be interpreted as probabilities. But CNTK doesn’t have a basic cross entropy loss function; it only has a cross entropy with softmax loss function. This means that in the snippet above softmax is applied twice.

Why is this a problem? In the image below, imagine you have a neural network classifier where one training item has correct output of (0, 1, 0). The computed output values are (1.5, 3.5, 2.5) and if softmax is applied the output values become (0.0900, 0.6652, 0.2447). If you stopped here and used regular cross entropy loss, you have nice separation between the output values.

But, if you apply softmax a second time, the output values become (0.2535, 0.4506, 0.2959). The values still sum to 1.0 and the middle value is still largest, but the three output values are now much closer to each other. With less separation, the training algorithm will improve more slowly.

So, the correct way to implement a CNTK classifier is:

h_layer = C.layers.Dense(hidden_dim,
  activation=C.ops.tanh, name='hidden_layer')(X)  
o_layer = C.layers.Dense(output_dim,
  activation= None, name='out_layer')(h_layer)
nnet = o_layer
model = C.ops.softmax(nnet)
. . .
tr_loss = C.cross_entropy_with_softmax(nnet, Y)
. . .
(train nnet)
. . .
(use model to make predictions)

The idea is to train an un-softmaxed nnet object using cross entropy with softmax, so softmax is applied only once. Then after training, you use the parallel model object which does have softmax.

Very tricky.

A very tricky puzzle box that requires many moves to open.

Posted in CNTK, Machine Learning | Leave a comment

Using a Neural Network Created by CNTK or TensorFlow in a C# Program

Whenever I give a technical talk on neural networks with the CNTK or TensorFlow or Keras libraries, I usually get a question like, “OK, I understand how to create and train a neural network using CNTK/TF/Keras using Python. But my production systems are written in C#. How can I use the CNTK/TF/Keras network in a C# program?”

At the time I’m writing this blog post, there isn’t any sophisticated interop story. The technique I use is to create and train a neural network model using CNTK/TF/Keras with Python, then write the neural network weights and bias values to a text file.

Then I write a C# program that creates a neural network with the same architecture as the one that was create using CNTK/TF/Keras (meaning same number of input, hidden, and output nodes, and same activation functions).

In my C# neural network I write a method to read the saved weights and bias values into the neural network. And I write a method to generate output.

The two screenshots below show what I mean. In the first screenshot I use CNTK to create and train a 4-2-3 NN for the famous Iris Dataset example. After training, I save the 4*2 + 2 + 2*3 + 3 = 19 weights and biases to a text file. The CNTK program feeds (6.4, 3.2, 4.5, 1.5) to the NN and the output is (0.0636, 0.6770, 0.2594) which maps to (0, 1, 0) which means Iris versicolor.

The second screenshot shows a C# program that reads the saved weights into a NN and computes output values for the same set of inputs as the CNTK program. The C# program gets the exact same results — (0.0636, 0.6770, 0.2594).

I fully expect that within a year or two someone will create a C# library that can directly read the binary format used by CNTK to save an entire model (which includes a huge amount of information). Information exchange between systems is a fundamental theme in computer science.

Two lovers exchange a kiss in the rain. Artist unknown.

Posted in CNTK, Machine Learning | 2 Comments

Earth Mover Distance Wasserstein Metric Example Calculation

The Earth Mover Distance (EMD) is the distance between two mathematical distributions. EMD is also called the Wasserstein metric. Here’s an example of how EMD is calculated.

Suppose you have a distribution called “holes” with 13 values where each value is a pair: (1,1), (1,1), (1,1), (1,1), (1,1), (1,1), (1,1), (1,1), (1,1), (1,1), (4,2), (6,1), (6,1). And suppose a second distribution is called “dirts”: (2,1), (2,1), (3,2), (3,2), (3,2), (5,1), (5,1), (5,1), (5,1), (5,1), (5,1), (5,1), (7,1).

What is the distance between the two distributions?

The idea of EMD is to imagine the minimum amount of work required to move the “dirts” into the “holes”. See the graph below. The graph has the data points clumped together and sized according to the number of points in each clump.

The amount of work done is the amount of dirt moved (the “flow”) times the distance moved. For example, the work required to move the 2 units of dirt in d1 a distance of 1 to hole h1 is 2 * 1 = 2. The EMD is the total amount of work divided by the total flow (amount of dirt). Another way to think of EMD is as the effort required to transform one distribution to another.

Finding the minimum amount of work is a very difficult problem, but for this dummy example, it’s pretty clear how to do it:

from   to  'flow'   dist   work
d1     h1    2      1.00    2.00
d2     h1    3      2.24    6.71
d3     h1    5      4.00   20.00
d3     h2    1      1.41    1.41
d3     h3    1      1.00    1.00
d4     h3    1      1.00    1.00
            --             -----
            13             32.12

First you move all 2 dirts in d1 to h1. Then you move all 3 dirts in d2 to h1. Hole h1 needs 5 more to fill so you move 5 of the 7 dirts in d3 to h1. Hole h1 is now filled (and dirt d3 has 2 left). Next you move 1 dirt from d3 to h2, filling h2, leaving just 1 dirt in d3. You move 1 dirt in d3 to h3, and then the 1 dirt in d4 to h3.

The total work done is 32.12 units, and the total flow is 13, so the EMD = 32.12 / 13 = 2.47 units. Interesting!

The EMD pops up in machine learning in several scenarios. For example, EMD can be used to give the distance between two images, where the pixels in the images are distributions. EMD is related to many other areas of mathematics too.

“Chester Bridge Street” with the Cross and St. Peter’s Church in the distance. Louise Rayner, 1877.

Posted in Machine Learning | 1 Comment

Is Bitcoin Just a High-Tech Ponzi Scheme?


But I have created a much better Bitbill. Mail me $50 of your obsolete cash and I will send you a $100 Bitbill. You make $50 profit. By Grabthar’s Hammer, what a savings.

Posted in Miscellaneous | 1 Comment

Neural Binary Classification using CNTK

I wrote an article titled “Neural Binary Classification using CNTK” in the March 2018 issue of Microsoft MSDN Magazine. See

The goal of a binary classification problem is to make a prediction where the value to predict can take one of just two possible values. For example, you might want to predict if a hospital patient has heart disease or not, based on predictor variables such as age, blood pressure, sex and so on.

There are many techniques that can be used to tackle a binary classification problem. In my article I explained how to use the Microsoft Cognitive Toolkit (CNTK) library to create a neural network binary classification model for the Cleveland Heart Disease dataset (patient has heart disease, or not).

When using libraries other than CNTK, you set up your training data so that the two values to predict are encoded as 0 and 1. For example:

Age  Height  Income  Sex
36   72      48.00   0
25   62      37.50   1

Then you use either mean squared error or binary cross entropy error as the loss function during model training. But weirdly, CNTK doesn’t have a binary cross entropy error. For this reason, and a couple of other CNTK-quirky reasons, when doing binary classification with CNTK, it’s better to set up your data so that you encode one class as (1, 0) and the other as (0, 1). For example:

Age  Height  Income  Sex
36   72      48.00   1 0
25   62      37.50   0 1

In essence, when using CNTK it’s best to treat a binary classification problem as a multi-class classification problem. Unfortunately, the last time I checked the CNTK documentation, there was no explanation of why CNTK doesn’t have a binary cross entropy loss function.

“Lobsterboat Maintenance” (ca. 1950), Francis Quirk. Privately held.

Posted in CNTK, Machine Learning

Preparing Text for an LSTM Recurrent Neural Network

One of the really interesting deep learning techniques is text analysis with an LSTM (“long, short-term memory”) recurrent neural network. LSTMs can work with sequences of text because, unlike other kinds of neural networks, LSTMs have memory.

Preparing data for an LSTM is a major challenge. I decided to work with the text from the Sherlock Holmes novel “A Study in Scarlet”. The first few words of the novel, after the title and the table of contents, are, “In the year 1878 I took my degree of Doctor of Medicine of the University of London, and proceeded . . ”

Conceptually, the idea is to create a training dataset that is like:

In the year 1878 | I 
the year 1878 I  | took
year 1878 I took | my
. . .

Each sequence of four words is used to predict the next word. This is “rolling window” data where the size of the window, four in this case, must be determined by trial and error.

However, all neural networks, including LSTMs, only understand numeric values. As it turns out, if you use a naive approach and just assign an integer to each word, well, it just doesn’t work. For example, if ‘In’ = 1, ‘the’ = 2, ‘year’ = 3, ‘1878’ = 4, ‘I’ = 5, and so on, the conceptual training data would look like:

1 2 3 4 | 5
2 3 4 5 | 6
3 4 5 6 | 7
. . .

The standard technique of using one-hot encoding doesn’t seem to work very well either. If you had a total of 10 different words, then ‘In’ = (1 0 0 0 0 0 0 0 0 0), ‘the’ = (0 1 0 0 0 0 0 0 0 0), ‘year’ = (0 0 1 0 0 0 0 0 0 0), etc. The problem here is that even a small training corpus would have thousands of distinct words, so each one-hot vector would be huge.

The usual approach is to create a “word embedding” where each word is represented by a vector of two or more values, where the vectors are constructed in an extremely clever way so that similar words have similar vectors.

I wrote a utility program that created a training file for an LSTM using “A Study in Scarlet”. The process was tricky and took me my entire lunch hour — and I’m quite good at code like this.

It would take several pages to explain the multi-step process, but the screenshot below summarizes what is going on. A very interesting challenge. Next step is to see if I can build an LSTM model that can understand the structure of a Sherlock Holmes novel.

Note: Depending on the type of analysis I’m going to do, I may have to one-hot encode the word-to-predict.

“A Study in Scarlet” was published in an annual magazine (like a paperback novel today) in November 1887, along with two other short stories by other authors. It was the first appearance of Sherlock Holmes.

Posted in Machine Learning