I Give a Talk on Back-Propagation

I recently gave a lecture on back-propagation. I’ve spoken on this topic before and it’s always a challenge because back-propagation has many interrelated ideas.

Even defining back-propagation is quite tricky because you can think of it in many ways. Back-propagation is a technique (or algorithm) to find the values of the weights and biases of a neural network. But this definition only makes sense if the audience fully understands everything about the neural network input-output mechanism.

Back-propagation is based on the Calculus gradient of the neural network error function. Again, for this to make any sense, the audience has to completely understand neural network error — both squared error and cross entropy error — and Calculus gradients, which are a form of Calculus derivatives.

And so on, and so on. There are multiple concepts that all have dependencies on other concepts.

That said, to understand back-propagation, you just have too keep looking at it. I remember when I was first learning back-prop many years ago, it seemed like the learning process would never end. But eventually it did.

I think the real challenge is for a person o determine at how deep a level they need to understand back-propagation. It’s sort of like learning about an automobile. You have to have basic knowledge in order to use a car, deeper knowledge to perform basic maintenance like change oil, deeper knowledge yet to work on a transmission, and so on.

In the same way, a machine learning practitioner doesn’t need to have a profound depth of knowledge to use a neural network library that uses back-propagation. But the more you know, the better in general — but I can imagine scenarios where knowing too much could work against you.


Director Alfred Hitchcock did two entirely different movies, with the same title of “The Man Who Knew Too Much”, one in 1934 and one in 1956. I prefer the 1956 version.

Advertisements
Posted in Machine Learning | Leave a comment

IMDB Movie Review Sentiment Analysis using Keras

The IMDB movie review dataset consists of a total of 50,000 movie reviews from ordinary people. Reviews are simple text and can be positive (7 stars or more) or negative (4 stars or fewer).

The Keras neural network library documentation has a demo program, but the demo “cheats” by importing a built-in version of the movie review data. The documentation code looks like:

import numpy as np
import keras as K
from keras.datasets import imdb  # too easy!
. . .

I set out to do an end-to-end demo that starts from the raw IMDB data because getting data ready for a machine learning model is at least 90% of the effort of almost any realistic problem.

Anyway, after quite a bit of effort, I succeeded. I think.

I worked backwards by analyzing the structure of the Keras version of the IMDB dataset, then I reverse engineered code to get data in the same format. It wasn’t easy.

In the end, I had training data that looks like:

0 0 0 . . 47 2319 167 . . 1
0 0 0 . . 438 55 5211 . . 0
. . .

Each line is one review. The leading 0s are padding to make each review exactly 500 words long. The next integer values are the words of the review such that the lowest value is the most common word. For technical reasons, the lowest integer value is 4, which is “the”, and then 5 is “and”, and so on. The very last value is either 1 (positive review) or 0 (negative review). There were 129,888 distinct words in the training reviews.

After getting the data ready, I created a test LSTM model that encodes each word-index into a vector of 32 values (a “word embedding”). The single LSTM cell had a memory of size 100.

Anyway, after a lot of work I was able to get all the parts to work together (although there was much pain along the way). My resulting model achieved 86.06% accuracy on the test items — about 23,000 of the 25,000 test reviews because I limited the length to 500 words too. I’d hoped for better accuracy.

There are still some unknown parts of the system, but I’m making good progress understanding the problems.



“Lovers in a Cafe”, Gotthardt Johann Kuehl, c. 1900. Paintings of happy people make me happy.

Posted in Keras, Machine Learning | Leave a comment

Introduction to Keras with TensorFlow

I wrote an article titled “Introduction to Keras with TensorFlow” in the May 2018 issue of Visual Studio Magazine. See https://visualstudiomagazine.com/articles/2018/05/01/inroduction-to-keras.aspx.

It’s possible to create neural networks from raw code. But there are many code libraries you can use to speed up the process. These libraries include Microsoft CNTK, Google TensorFlow, Theano, PyTorch, scikit-learn and Caffe. Most neural network libraries are written in C++ for performance but have a Python API for convenience.
In my article I demonstrated how to get started with the popular Keras library. Keras is a bit unusual because it’s a high-level wrapper over TensorFlow. The idea is that TensorFlow works at a relatively low level and coding directly with TensorFlow is very challenging. Put another way, you write Keras code using Python. The Keras code calls into the TensorFlow library, which does all the work.

I did the standard Iris dataset example where the goal is to predict species (“setosa” or “versicolor” or “virginica”) from four predictors: petal length and width, and sepal length and width (a sepal is a leaf-like structure).

I like Keras a lot, but it does have disadvantages. For me, Keras is easy to use but is relatively hard to customize — a classic code library tradeoff.

I slightly prefer the Microsoft CNTK library to Keras — mostly for technical reasons. But the use of Keras seems to be increasing much faster than the use of CNTK, even though I have no solid numbers (I base my opinion on subjective things like number of posts in Stack Overflow so I could well be wrong).

If you’re a software developer, you might want to consider taking Keras out for a test drive.



The word “keras” means “horn” in Greek. Trust me, you will get some strange results if you do an Internet image search for “horn”. This horn hairstyle is striking but doesn’t look practical for daily use.

Posted in Keras, Machine Learning | Leave a comment

Avoiding an Exception when Calculating Softmax

The softmax of a set of values returns a set of values that sum to 1.0 so they can be interpreted as probabilities. The softmax function is one of the fundamental tools for machine learning. Suppose you have some neural network classifier to predict if a person is democrat, republican, or other. The raw output values of the network could be something like (3.0, 5.0, 2.0) but the result of softmax would be (0.1142, 0.8438, 0.0420) which would mean P(republican) = 0.8438 and so that’s the prediction because it has the highest probability.

Mathematically, to compute the softmax of a set of values, you compute the exp() of each value, and sum those values. The exp(x) function is Euler’s number (not Euler’s constant) raised to x. Then the softmax of each value x is exp(x) / sum.

The calculation can blow up because the exp(x) function can return astronomically large values for even moderate-sized values of x. One way to avoid an exception is to use the “max trick”. Because of the properties of the exp(x) function, you can find the max of the x values, subtract the max from each x, compute and sum each exp(x-max) and then the softmax of x is as before, exp(x-max) / sum.

For example, for (3.0, 5.0, 2.0), the max is 5.0 and subtracting gives (-2.0, 0.0, -3.0) then the exp() of each is (0.1353, 1.0000, 0.0498). The sum of those values is 1.1851 and 0.1353 / 1.1851 = 0.1142, 1.0000 / 1.1851 = 0.8438, 0.0498 / 1.1851 = 0.0420. Notice that all the exp(x) calculations occur on small or negative values.

A variation of the max trick is to avoid the division operation. To do this you compute the ln() of the sum of the exp(x-max) values and then instead of dividing, you subtract and take the exp(x – max – ln(sum)).

For example, the sum of the exp(x) values is 1.1851 and the ln(1.1851) = 0.1698. Then exp(-2.0 – 0.1698) = 0.1142, exp(0.0 – 0.1698) = 0.8438, and exp(-3.0 – 0.1698) = 0.0420.

A few years ago, when neural networks had to be implemented from scratch, you’d have to know details like this. But in the last two years or so, with the creation of neural network libraries such as TensorFlow and CNTK, all these details are handled for you. But it still good to know what goes on behind the scenes.



Some examples of newspaper headlines that should have thrown an exception.

Posted in Machine Learning | Leave a comment

Betweenness Centrality

Network graphs are interesting data structures. You can compute all kinds of metrics on a graph, including several measures of centrality. Centrality metrics indicate how important a node is in some way. Different centrality measures give you different information.

Betweenness centrality is a measure of how “between” other nodes a particular node is. If a node has a high betweenness centrality value, more information passes through the node.

Betweenness centrality (BC) is a bit tricky to compute. To compute BC for node 1 in the graph below you examine all pairs of nodes that don’t start or end with 1. For each pair, you find the total number of shortest paths, and the number of those shortest paths that contain 1 on those paths. You sum the proportions.

So for node 1, BC(1) = 2.5

(2,3): 1 / 1 = 1.0
(2,4): 0 / 1 = 0.0
(2,5): 1 / 2 = 0.5
(2,6): 0 / 1 = 0.0
(3,4): 1 / 1 = 1.0
(3,5): 0 / 1 = 0.0
(3,6): 0 / 1 = 0.0
(4,5): 0 / 1 = 0.0
(4,6): 0 / 1 = 0.0
(5,6): 0 / 1 = 0.0
               ===
               2.5

There is only one shortest path from (2,3) and it is 2-1-3. This path contains 1. There are two shortest paths from (2,5) and they are 2-1-3-5 and 2-4-6-5. Of these, one path contains 1. And so on.

The BCs for the other nodes would be computed in the same way.

I enjoy working with graphs, but for some reason, I don’t love working with graphs.



“The Judgement of Paris”, Georges Barbier. Circa 1925.

Posted in Miscellaneous | Leave a comment

My Top Ten Favorite Movies Set on Trains

I like movies that take place mostly a train. The limited amount of room that is available forces directors, writers, and actors to be clever and creative. Here are my top 10 favorites.


1. Murder on the Orient Express (1974) – In the 1930s, Belgian detective Hercule Poirot is traveling from Turkey to Paris when one of the passengers is murdered. Unlike many mysteries, this case has too many clues pointing to too many suspects. Great plot (based on the Agatha Christie novel), great acting. My grade is a solid A+.


2. Terror by Night (1946) – On a train from London to Scotland, a man is murdered and the Star of Rhodesia diamond is stolen. Sherlock Holmes and Dr. Watson are on board because they had been hired to protect the diamond. Will they discover which one of the several suspects is guilty, and recover their reputations? Grade = B+.


3. Source Code (2011) – Colter Stevens (played by Jake Gyllenhaal) wakes up on a train headed to Chicago, but he doesn’t know who he is or have any memory. As it turns out, he has been sent from the future to stop a terrorist bomb plot, but he only has eight minutes. Clever plot, good acting, unexpected happy ending. Grade = B+.


4. Horror Express (1972) – In 1906, actors Christopher Lee and Peter Cushing play scientists who are on a train from China to Moscow. Grisly murders occur. Who, or what, is behind it all? An interesting science fiction movie that isn’t very well known. Grade = B.


5. Under Siege 2: Dark Territory (1995) – Casey Ryback (played by the so-bad-he-is-awesome Steven Seagal) is on a train from Denver to Los Angeles. Armed mercenaries hijack the train. Bad idea when Seagal is on board! It’s easy to make fun of Seagal movies but this one is quite good if you like lots of action. Grade = B.


6. The Commuter (2018) – Michael MacCauley (played by Liam Neeson) is a former cop now in a mundane job, traveling to and from work on the same commuter train every day. On one trip, a mysterious woman offers MacCauley a lot of money to find a person on the train who is travelling incognito, and put a GPS device on him/her. Will MacCauley figure out the plot? (Hint: Of course he will!) Grade = B.


7. The Lady Vanishes (1938) – Iris Henderson, a young tourist traveling on a train in Europe, makes acquaintance with an older woman, Miss Froy. The older woman disappears but nobody except Iris remembers seeing her! With handsome fellow traveler Gilbert Redman, Iris unravels the mystery. Directed by Alfred Hitchcock. Grade = B.


8. Bombay Mail (1934) – British Inspector Dyke is on a train travelling from Calcutta to Bombay in the 1930s. A British governor on the train is poisoned by cyanide. Classic murder mystery with lots of clues and suspects. Grade = C+.


9. Terror Train (1980) – A group of college students hold a New Year’s Eve costume party on a train. Having everyone wear costumes when one of them is a psychotic murderer is a bad idea. An early starring role for actress Jamie Lee Curtis. Standard horror/slasher fare but not all that bad. Grade = C.


10. Snowpiecer (2013) – Bizarre science fiction movie. The last remnants of humanity live on an atomic powered train that travels endlessly through a frozen wasteland. This movie got excellent reviews, but my grade is just a solid C.



A few other movies set on trains:

Transsiberian (2008) – An American couple’s journey from China to Russia becomes a nightmare after they befriend a pair of fellow travelers.

The Taking of Pelham 1 2 3 (1974) – A subway train is hijacked for ransom. Most reviewers give this film great reviews but it has never resonated for me.

Last Passenger (2013) – A train travelling out of London is under the control of a suicidal engineer. Very much an action film. Well-received by critics but it never completely grabbed me.

Snakes on a Train (2006) – Snakes + Mayan curse + train = all you need to know. I actually liked this movie better than the big-budget “Snakes on a Plane”.

Breakheart Pass (1975) – Set in the old west, a train is carrying a Marshal and an outlaw but everything isn’t what it appears.

Murder on the Orient Express (2017) – I had high hopes for this remake, but it was, in my opinion, inferior in every way to the 1974 version.

Von Ryan’s Express (1965) – Frank Sinatra plays a WWII prisoner of war who, along with others, is being transported from Italy to Germany.

Runaway Train (1985) – The driver of a speeding train in the Alaska wilderness has a heart attack. With Rebecca De Mornay.

Unstoppable (2010) – Chris Pine and Denzel Washington must stop a runaway freight train.

Posted in Top Ten | Leave a comment

A Quick Look at the Embedded Learning Library (ELL)

The Embedded Learning Library (ELL) is an open source project. The goal is to create a cross compiler for machine learning models. Briefly, machine learning models, such as those for image recognition, are typically very large in terms of memory. This is OK if you’re going to run the ML model on a powerful PC, but not OK if you want to run the model on a small IoT device.

The ELL library has tools to convert a standard ML model, such as one created by CNTK or Darknet, into a universal .ell file (in JSON format). Then ELL tools can compile the .ell file into a model for a target device, such as a Raspberry Pi or an Android phone.

The first step was to install the ELL system on my machine. The ELL library is still under development and so installing ELL means compiling C++ source code. If you’ve ever done this, you know you can expect problems. And I ran into plenty, but fortunately for me, the ELL team sits about 50 feet away from me, so, I was able to get ELL built with their expert help.

Next, I created a CNTK model for the Iris Dataset. A CNTK model, like most models, is saved in a binary format. I used the ELL cntk_import.py tool to create a .ell format model, and then I used the ELL wrap.py tool to convert the .ell file into a Python package suitable for deployment to a PC. In a realistic scenario, I’d convert the .ell file into some sort of file suitable for deployment on a Raspberry PI, or similar, IoT device.

My last step was to write a short Python script to test the ELL result package. I did so, and got the same results as the original CNTK model. Nice!

The entire process took several hours even though the ELL documentation at https://github.com/Microsoft/ELL/blob/master/INSTALL-Windows.md is very good. Because there are thousands of files involved, and many tools (Visual Studio C++ compiler, Python engine, the cmake tool, the SWIG library, and so on) used, I ran into quite a few problems with things like version dependencies and my PATH environment variable.

The Microsoft ELL library is extremely ambitious. The ELL team is quite small, so they’re going to have to prioritize their efforts. But if successful, ELL will have a big impact on developers who want to deploy a machine learning model on an IoT device.



An ambitious canine, shooting for the stars

Posted in CNTK, Machine Learning | Leave a comment