I gave the keynote talk for the event. My keynote was titled “The Present and Future of Machine Learning and Artificial Intelligence”. For “the present”, I described what deep neural networks are, LSTM networks, CNN networks, and so on. For “the future”, I talked about GANs, homomorphic encryption, quantum computing, etc.

I think the one slide I got most excited about was the one where I described the AlphaZero deep RL chess program, and its stunning 28-0 win against the reigning world champion program, Stockfish. This amazing achievement shows the incredible potential of ML.

Most of the attendees I talked to were data scientists or business analysts at medium and large size companies, such as banks, insurance companies, energy companies, and state and federal government. But there were many attendees from small companies, and from all kinds of backgrounds too.

Many big tech companies were represented at the 2019 TDWI event including Google, IBM, Oracle, SAP, SAS, and others. The event Expo was nice even though it was relatively small. There were about 40 companies there. I especially enjoyed talking to the representatives from a Seattle-based company named Algorithmia.

All things considered, the 2019 TDWI Conference was a very good use of my time. I learned a lot, both technically and from a business perspective, and I’m confident I was able to educate attendees about Microsoft machine learning technologies. And I returned to my work with renewed enthusiasm and new ideas.

*The 2019 TDWI Expo was relatively small but had a lot of interesting companies. I enjoyed talking to the representatives from the companies because I gained useful insights into the business side of ML/AI that will help me do my job better.*

On a recent trip, for some reason, the topic of Naive Bayes classification popped into my head. I spent the next couple of hours mentally coding up a minimal implementation, with a little bit of help from an airline paper napkin and a pen.

When I got back home, I opened up my laptop and coded my mental implementation using Python. There were several glitches of course, but I had all the main ideas correct.

For my demo, I created a 40-item dummy data file that looks like:

‘A’, ‘R’, ‘T’, ‘1’

‘C’, ‘R’, ‘S’, ‘0’

‘Z’, ‘L’, ‘M’, ‘0’

‘Z’, ‘R’, ‘M’, ‘1’

. . .

There are three predictor values followed by a 0 or a 1. The goal is to classify data as 0 or 1. The first variable can be one of (A, C, E, Z). The second variable can be one of (L, R). The third variable can be one of (S, M, T). There are versions of Naive Bayes that work with numeric predictor data, but the simplest form works with categorical predictor values. My demo is binary classification, but Naive Bayes easily extends to multiclass classification.

An implementation of Naive Bayes is relatively short but quite deep. A full explanation would take several pages. But briefly, joint counts (such as the count of items with both ‘E’ and ‘0’) are computed, and counts of the dependent values (0 and 1) are computed, and combined according to Bayes Law to yield probabilities.

One important implementation factor is the tradeoff between a specific implementation for a given problem, versus a completely general implementation that can be applied to most problems. I decided, as I usually do, for a mostly specific, non-general implementation.

In my demo, I classified inputs (E, R, T). The result is a pair of values (0.3855, 0.6145) which loosely represent the probability that the input is class 0 and class 1. Because the second value is larger, the prediction is class 1.

*I have always been fascinated by model train buildings. Some are truly works of art. Most of the people I’m comfortable hanging out with enjoy modeling reality in some way (machine learning, art, games, etc.)*

For my demo, I created a small text file that has five lines of data from the well-known Iris Dataset:

5.5, 2.5, 4.0, 1.3, 0, 1, 0 6.3, 3.3, 6.0, 2.5, 0, 0, 1 5.8, 2.7, 5.1, 1.9, 0, 0, 1 7.1, 3.0, 5.9, 2.1, 0, 0, 1 6.3, 2.9, 5.6, 1.8, 0, 0, 1

Before starting, I coded up function to create and print a two-dimensional numeric matrix.

There are many ways to read data from file in using the Node.js system. My approach reads the entire file contents into memory, splits on “\n” into an array of strings, then parses out each line:

let fs = require('fs'); let all = fs.readFileSync('iris_five.txt', "utf8"); all = all.trim(); // final crlf in file let lines = all.split("\n"); let n = lines.length; let m = matrixMake(n, 7, 0.0); // numeric for (let i = 0; i < n; ++i) { // each line let tokens = lines[i].split(","); for (let j = 0; j < 7; ++j) { // each val curr line m[i][j] = parseFloat(tokens[j]); } } matrixPrint(m, 1); // 1 decimal

This approach isn’t very robust, and won’t work for huge files, but it’s simple and effective for basic neural network purposes. Here’s a version of the code that’s been refactored into a function that resembles the Python NumPy loadtxt() function:

let fs = require('fs'); function loadTxt(fn, delimit, usecols) { let all = fs.readFileSync(fn, "utf8"); // giant string all = all.trim(); // strip final crlf in file let lines = all.split("\n"); let rows = lines.length; let cols = usecols.length; let result = matrixMake(rows, cols, 0.0); for (let i = 0; i < rows; ++i) { // each line let tokens = lines[i].split(delimit); for (let j = 0; j < cols; ++j) { result[i][j] = parseFloat(tokens[usecols[j]]); } } return result; } data_x = loadTxt(".\\iris_train.txt", ",", [0,1,2,3]); data_y = loadTxt(".\\iris_train.txt", ",", [4,5,6]);

*From the move “The Matrix” (1999) – the white rabbit girl, the mysterious woman in the red dress, Trinity, Persephone.*

I normally wouldn’t give a talk on a topic where I don’t fully understand all the details. But, I’m working with a team in my large tech company, and if my autoencoder reconstruction idea is valid, the technique will be extremely valuable to them.

As always, when I presented the details, the attendees in the audience asked great questions which forced me to think very deeply. (The people at my company are very smart for the most part). This details-are-important fact is characteristic of the research in machine learning I’m doing.

Here’s one of at least a dozen examples (which will only make sense if you understand neural autoencoders). The dataset had 784 input values — the MNIST image dataset where each value is a pixel value between 0 and 255, normalized t between 0.0 and 1.0. My demo autoencoder had a 784-100-50-100-784 architecture. The hidden layers used tanh activation, and I applied tanh activation to the output layer too.

But the question is, why not sigmoid activation, or ReLU activation, or even no/identity activation on the output layer? The logic is that because the input values are between 0.0 and 1.0, and an autoencoder predicts its inputs, you surely want the output values to be confined to 0.0 to 1.0 which can be accomplished using sigmoid activation. Why did I use tanh output activation?

Well, the answer is long, so I won’t try. My real point is that this was just one of many details about the autoencoder reconstruction error technique for anomaly detection. And on top of all the conceptual ideas, I used the PyTorch neural network library so there were many language and engineering issues to consider too.

Anyway, I thought I did a good job on my talk, and I get as much value from delivering it as the attendees who listed to it.

*Artist Mort Kunstler (b. 1931) created many memorable paintings that were used for the covers of men’s adventure magazines in the 1960s. I’m not really sure if the works are supposed to be satire or not. Kunstler’s paintings have an extremely high level of detail.*

Suppose k is set to 5. The first part of k-NN calculates the distances from the item-to-be-classified to all the labeled data, then finds the k = 5 nearest/closest labeled data vectors.

For example, suppose the classification problem has 3 classes, and the 5 nearest data vectors are:

data dist class ===================================== (0.45, 0.32, 0.78, 0.11) 0.1234 1 (0.65, 0.33, 0.92, 0.54) 0.2076 0 (0.52, 0.47, 0.82, 0.55) 0.3588 0 (0.41, 0.78, 0.43, 0.58) 0.5032 1 (0.73, 0.29, 0.44, 0.61) 0.7505 2

Which class, 0, 1, 2, is the item-to-be-classified? Class 1 is the closest, but class 0 is the second and third closest.

The simplest approach is to use uniform voting weights. For k = 5 each weight is 1 / 5 = 0.20 so

Class 0: 2 * 0.20 = 0.40 Class 1: 2 * 0.20 = 0.40 Class 2: 1 * 0.20 = 0.20

This results in a tie between class 0 and class 1. The uniform voting weights approach is equivalent to a simple majority vote.

There are several ways to create weights that give more importance to closer data values. The inverse weights approach computes the inverse of each weight, sums the inverses, then takes each inverse divided by the sum of the inverses. For the data above the weights are:

dist inverse wts = inv/sum =============================== 0.1234 8.1037 0.4259 0.2076 4.8170 0.2532 0.3588 2.7871 0.1465 0.5032 1.9873 0.1044 0.7505 1.3324 0.0700 19.0275 1.0000

The voting is:

Class 0: 0.2532 + 0.1465 = 0.5303 Class 1: 0.4259 + 0.1044 = 0.3996 Class 2: 0.0700 = 0.0700

And so the conclusion is the item-to-be-classified is class 0.

I usually use the inverse weights approach but there are other weighting techniques for k-NN. One alternative is to use the ranks of the distances and compute rank order centroids. This weights closer labeled data more heavily than the inverse technique. Another approach is to sum the distances, take each distance divided by the sum, then reverse the order of the weights. This penalizes big distances more heavily than the inverse technique.

*I find high fashion quite interesting. I couldn’t even begin to assign quality weights to the dresses at a fashion show.*

But what if your data has some non-numeric data. For example, imagine some fake flower data:

blue 5.1 3.5 1.4 0.2 pink 4.9 3.0 1.4 0.4 teal 4.7 3.2 1.3 0.3 etc.

The first value is the color of the flower then the next four values are sepal length, sepal width, petal length, petal width. How do you deal with the color variable if you want to cluster the dataset? If you’re not familiar with clustering you’d think this would be easy, but trust me, it’s not.

My idea is to convert mixed data into strictly numeric data by using a neural autoencoder. I ran a little experiment where I first encoded the color as blue = (1, 0), pink = (0, 1), teal = (-1, -1) then I created a 6-10-8-10-6 autoencoder that accepts the six input values and predicts those same values. After training, the 8 nodes in the central hidden layer are a strictly numeric representation of each flower. . .

At least in theory. My scheme is somewhat related to and based on word embeddings where words are converted to numeric vectors in a roughly similar way.

Anyway, there’s a lot going on here and I’ve found almost zero existing research or practical information along these lines. One of the problems when working with clustering is that a good clustering is very hard to define precisely. But I’ll keep probing away at this problem.

*“Good” art is impossible to define precisely. Five examples from artists whose work is often described as kitsch/bad. Beauty is in the eye of the beholder but I think all five paintings are wonderful. “The Green Lady”, Vladimir Tretchikoff. “Lamplight Manor”, Thomas Kinkade. “Tina”, JH Lynch. “Dancers in the Koutoubia Palace”, LeRoy Neiman. “Gypsy Girl”, Charles Roka.*

Infer.NET is a code library created by Microsoft Research. The library contains functions that can be used to solve problems that involve probability. In my article, I demonstrate how to use Infer.NET to infer/calculate the ratings of a set of sports teams based on some win-lose data.

Specifically, I set up six teams: Angels, Bruins, Comets, Demons, Eagles, Flyers. Then I set up nine game results:

Angels beat Bruins Bruins beat Comets Comets beat Demons Angels beat Eagles Bruins beat Demons Demons beat Flyers Angels beat Flyers Comets beat Eagles Eagles beat Flyers

Then I supplied some math assumptions: the ratings of the teams are Normal (Gaussian) distributed, and the average rating of a team is 2000 with a standard deviation of 200. These assumptions were somewhat arbitrary. Then the Infer.NET program used its built-in functions to find the set of ratings that best match the data. The results were: Angels = 2256.8, Bruins = 2135.3, Comets = 2045.5, Demons = 1908.0, Eagles = 1914.6, Flyers = 1739.7.

This is an example of what’s called maximum likelihood expectation. Working backwards, staring with these ratings, the win – lose data is more likely than if the ratings were different.

The Infer.NET code library, which was developed by researchers, feels different than working with a code library that has been designed by software engineers. This creates some friction for most developers who want to use Infer.NET. My point here is that the functionality of a code library is important, but the interface is important too.

There are other probabilistic programming code libraries and languages, including Stan, Pyro, and Edward, which are the most common. The Wikipedia entry on Probabilistic Programming Languages (PPLs) lists 46 libraries. None of the machine learning guys I work with use probabilistic programming which leads me to believe that PPLs are used mostly by researchers.

*College majors. Electrical engineering 88% male. Computer Science 82% male. Physical sciences 62% male. Health professions 85% female. Education 80% female. English 80% female. In the workplace, roughly 75% of computer science employees are male and 25% are female. Therefore, there’s a strong statistical argument that women are over, not under, represented in computer science jobs.*

The standard shuffle algorithm is called the Fisher-Yates shuffle (or less frequently, Knuth shuffle). I coded up a demo. The demo relies on a program-defined class that can generate reproducible pseudo-random numbers.

The calling code looks like:

let v = vecRange(8); // creates [0, 1, 2, .. 7] shuffle(v, 1); // 1 is a seed value for randomness // values in v now in different order

The whole idea here is that when training a neural network using back-propagation (also called stochastic gradient descent), it’s critically important that you visit the training items in a different, random order, on each pass (“epoch”).

OK, one more piece of the neural networks with JavaScript puzzle has fallen into place.

*One of the very first jigsaw puzzles made was a map of Europe (1766). I did a Disney villainesses puzzle over Christmas – it was surprisingly difficult. Clever face makeup for Halloween. I like jigsaw puzzles that feature historical topics, such as this one of the famous Confederate general Robert E. Lee from the American Civil War.*

I was very surprised at the number of weak examples I found on the Internet. Too much chit-chat, and not enough example code.

To cut to the chase, here’s my preferred way to do it. First, I created a file named utilities.js like so:

// utilities.js function matrixMake(rows, cols, val) { let result = []; // avoid new Array() for (let i = 0; i < rows; ++i) { result[i] = []; for (let j = 0; j < cols; ++j) { result[i][j] = val; } } return result; } function matrixPrint(m, dec) { let rows = m.length; let cols = m[0].length; for (let i = 0; i < rows; ++i) { for (let j = 0; j < cols; ++j) { process.stdout.write(m[i][j].toFixed(dec)); process.stdout.write(" "); } console.log(""); } } // -------------------------------------------------- module.exports = { matrixMake, matrixPrint };

There are two normal JavaScript function definitions followed by a Node.js module.exports block at the bottom.

I named the file to call the external functions test_utilities.js and coded it as:

// test_utilities.js console.log("Node.js exporting demo \n"); let U = require("./utilities.js"); let m = U.matrixMake(3, 4, 0.0); U.matrixPrint(m, 2);

Easy peasy. Use the require() function to load the external functions. The only gotcha was that even though I was running on Windows, I had to use the Linux path syntax with a forward slash instead of the Windows backslash.

The moral of the story is that if you want to explain a programming model to a software engineer, just show him some example code.

*“Model Airplane News” magazine has been published continuously since 1929. From left: March 1932, July 1945, January 1958, August 1965, June 1975, October 1982, October 2018.*

The Levenshtein distance between two strings is the minimum number of operations necessary to convert one string into the other, where an operation is an insertion, deletion, or substitution of a single character. For example, if:

A = cat B = cots

then the Levenshtein distance is 2. Starting with “cat”, you must change the ‘a’ to ‘o’, and then add an ‘s’. There is a very cool visual algorithm to compute the Levenshtein distance. I’ll walk you through it. First construct this matrix:

c a t 0 1 2 3 c 1 o 2 t 3 s 4

The short string goes on top, the longer sting goes vertical and each letter gets a 1-based index value.

Now, working by columns, from top to bottom, for each cell in the matrix, first assign a 0 if the corresponding characters match, and a 1 if the characters do not match. Then adjust this temp value by assigning the minimum of these three values:

1. the value in the cell above plus 1.

2. the value in the cell to the left plus 1.

3. the value in the cell to the upper left plus the temp value.

So the first cell in the example above gets a temp value of 0 (because the ‘c’ characters match) then gets modified to 0 (because condition #3 holds). The first column ends up like this:

c a t 0 1 2 3 c 1 0 o 2 1 t 3 2 s 4 3

The first cell in the second column gets a temp value of 1 because ‘a’ does not equal ‘c’. The temp value gets modified to 1 because condition #2 holds. The second column ends up like this:

c a t 0 1 2 3 c 1 0 1 o 2 1 1 t 3 2 2 s 4 3 3

In the same way, the last column becomes:

c a t 0 1 2 3 c 1 0 1 2 o 2 1 1 2 t 3 2 2 1 s 4 3 3 2

Now, the Levenshtein distance is the value in the lower right corner, 2. Remarkable. This is an interesting algorithm and one which can be implemented relatively easily.

*Five paintings that depict distance in some way. The center one is the famous “Christina’s World” (1948) by Andrew Wyeth.*