Recap of the 2019 TDWI Conference

I spoke at the 2019 TDWI Conference. The event ran from February 10-15 and was in Las Vegas. I estimate there were about 500 people at the conference. Like most technical conferences, there were standard speaking sessions, workshops and training classes, and an exhibit hall.

I gave the keynote talk for the event. My keynote was titled “The Present and Future of Machine Learning and Artificial Intelligence”. For “the present”, I described what deep neural networks are, LSTM networks, CNN networks, and so on. For “the future”, I talked about GANs, homomorphic encryption, quantum computing, etc.

I think the one slide I got most excited about was the one where I described the AlphaZero deep RL chess program, and its stunning 28-0 win against the reigning world champion program, Stockfish. This amazing achievement shows the incredible potential of ML.

Most of the attendees I talked to were data scientists or business analysts at medium and large size companies, such as banks, insurance companies, energy companies, and state and federal government. But there were many attendees from small companies, and from all kinds of backgrounds too.

Many big tech companies were represented at the 2019 TDWI event including Google, IBM, Oracle, SAP, SAS, and others. The event Expo was nice even though it was relatively small. There were about 40 companies there. I especially enjoyed talking to the representatives from a Seattle-based company named Algorithmia.

All things considered, the 2019 TDWI Conference was a very good use of my time. I learned a lot, both technically and from a business perspective, and I’m confident I was able to educate attendees about Microsoft machine learning technologies. And I returned to my work with renewed enthusiasm and new ideas.



The 2019 TDWI Expo was relatively small but had a lot of interesting companies. I enjoyed talking to the representatives from the companies because I gained useful insights into the business side of ML/AI that will help me do my job better.

Advertisements
Posted in Conferences | Leave a comment

Quick and Easy Naive Bayes Classification

Somewhat unusually, I can get a lot of technical work done while I travel. When I’m sitting in a terminal or sitting on a plane, there are no distractions and I do what I call mental coding.

On a recent trip, for some reason, the topic of Naive Bayes classification popped into my head. I spent the next couple of hours mentally coding up a minimal implementation, with a little bit of help from an airline paper napkin and a pen.

When I got back home, I opened up my laptop and coded my mental implementation using Python. There were several glitches of course, but I had all the main ideas correct.

For my demo, I created a 40-item dummy data file that looks like:

‘A’, ‘R’, ‘T’, ‘1’
‘C’, ‘R’, ‘S’, ‘0’
‘Z’, ‘L’, ‘M’, ‘0’
‘Z’, ‘R’, ‘M’, ‘1’
. . .

There are three predictor values followed by a 0 or a 1. The goal is to classify data as 0 or 1. The first variable can be one of (A, C, E, Z). The second variable can be one of (L, R). The third variable can be one of (S, M, T). There are versions of Naive Bayes that work with numeric predictor data, but the simplest form works with categorical predictor values. My demo is binary classification, but Naive Bayes easily extends to multiclass classification.

An implementation of Naive Bayes is relatively short but quite deep. A full explanation would take several pages. But briefly, joint counts (such as the count of items with both ‘E’ and ‘0’) are computed, and counts of the dependent values (0 and 1) are computed, and combined according to Bayes Law to yield probabilities.

One important implementation factor is the tradeoff between a specific implementation for a given problem, versus a completely general implementation that can be applied to most problems. I decided, as I usually do, for a mostly specific, non-general implementation.

In my demo, I classified inputs (E, R, T). The result is a pair of values (0.3855, 0.6145) which loosely represent the probability that the input is class 0 and class 1. Because the second value is larger, the prediction is class 1.



I have always been fascinated by model train buildings. Some are truly works of art. Most of the people I’m comfortable hanging out with enjoy modeling reality in some way (machine learning, art, games, etc.)

Posted in Machine Learning | Leave a comment

Reading a Text File of Numbers into a JavaScript Matrix Using Node.js

I’ve been looking at the idea of creating a neural network using plain JavaScript running in the Node.js system. A basic utility task is to read a text file of training data into memory as a numeric matrix.

For my demo, I created a small text file that has five lines of data from the well-known Iris Dataset:

5.5, 2.5, 4.0, 1.3, 0, 1, 0
6.3, 3.3, 6.0, 2.5, 0, 0, 1
5.8, 2.7, 5.1, 1.9, 0, 0, 1
7.1, 3.0, 5.9, 2.1, 0, 0, 1
6.3, 2.9, 5.6, 1.8, 0, 0, 1

Before starting, I coded up function to create and print a two-dimensional numeric matrix.

There are many ways to read data from file in using the Node.js system. My approach reads the entire file contents into memory, splits on “\n” into an array of strings, then parses out each line:

let fs = require('fs');
let all = fs.readFileSync('iris_five.txt', "utf8");
all = all.trim();  // final crlf in file
let lines = all.split("\n");
let n = lines.length;
let m = matrixMake(n, 7, 0.0);  // numeric

for (let i = 0; i < n; ++i) {  // each line
  let tokens = lines[i].split(",");
  for (let j = 0; j < 7; ++j) {  // each val curr line
    m[i][j] = parseFloat(tokens[j]);
  }
}
matrixPrint(m, 1);  // 1 decimal

This approach isn’t very robust, and won’t work for huge files, but it’s simple and effective for basic neural network purposes. Here’s a version of the code that’s been refactored into a function that resembles the Python NumPy loadtxt() function:

let fs = require('fs');
function loadTxt(fn, delimit, usecols) {
  let all = fs.readFileSync(fn, "utf8");  // giant string
  all = all.trim();  // strip final crlf in file
  let lines = all.split("\n");
  let rows = lines.length;
  let cols = usecols.length;
  let result = matrixMake(rows, cols, 0.0); 
  for (let i = 0; i < rows; ++i) {  // each line
    let tokens = lines[i].split(delimit);
    for (let j = 0; j < cols; ++j) {
      result[i][j] = parseFloat(tokens[usecols[j]]);
    }
  }
  return result;
}

data_x = loadTxt(".\\iris_train.txt", ",", [0,1,2,3]);
data_y = loadTxt(".\\iris_train.txt", ",", [4,5,6]); 


From the move “The Matrix” (1999) – the white rabbit girl, the mysterious woman in the red dress, Trinity, Persephone.

Posted in Machine Learning | Leave a comment

I Give a Talk About Anomaly Detection Using a Neural Autoencoder with PyTorch

Anomaly detection is a very difficult problem. I’ve been experimenting with a technique that I couldn’t find any research or practical information about. Briefly, to find anomalous data, create a neural autoencoder and then analyze each data item for reconstruction error — the items that have the highest error are (maybe) the most anomalous.

I normally wouldn’t give a talk on a topic where I don’t fully understand all the details. But, I’m working with a team in my large tech company, and if my autoencoder reconstruction idea is valid, the technique will be extremely valuable to them.

As always, when I presented the details, the attendees in the audience asked great questions which forced me to think very deeply. (The people at my company are very smart for the most part). This details-are-important fact is characteristic of the research in machine learning I’m doing.

Here’s one of at least a dozen examples (which will only make sense if you understand neural autoencoders). The dataset had 784 input values — the MNIST image dataset where each value is a pixel value between 0 and 255, normalized t between 0.0 and 1.0. My demo autoencoder had a 784-100-50-100-784 architecture. The hidden layers used tanh activation, and I applied tanh activation to the output layer too.

But the question is, why not sigmoid activation, or ReLU activation, or even no/identity activation on the output layer? The logic is that because the input values are between 0.0 and 1.0, and an autoencoder predicts its inputs, you surely want the output values to be confined to 0.0 to 1.0 which can be accomplished using sigmoid activation. Why did I use tanh output activation?

Well, the answer is long, so I won’t try. My real point is that this was just one of many details about the autoencoder reconstruction error technique for anomaly detection. And on top of all the conceptual ideas, I used the PyTorch neural network library so there were many language and engineering issues to consider too.

Anyway, I thought I did a good job on my talk, and I get as much value from delivering it as the attendees who listed to it.



Artist Mort Kunstler (b. 1931) created many memorable paintings that were used for the covers of men’s adventure magazines in the 1960s. I’m not really sure if the works are supposed to be satire or not. Kunstler’s paintings have an extremely high level of detail.

Posted in Machine Learning, PyTorch | Leave a comment

Determining Weights for Weighted k-NN Classification

The k-nearest neighbors classification algorithm is one of the oldest and simplest machine learning techniques. I was exploring the technique rdcently and was mildsly surprised to find very little practical information about how to generate k-NN voting weights.

Suppose k is set to 5. The first part of k-NN calculates the distances from the item-to-be-classified to all the labeled data, then finds the k = 5 nearest/closest labeled data vectors.

For example, suppose the classification problem has 3 classes, and the 5 nearest data vectors are:

data                      dist  class
=====================================
(0.45, 0.32, 0.78, 0.11)  0.1234  1
(0.65, 0.33, 0.92, 0.54)  0.2076  0
(0.52, 0.47, 0.82, 0.55)  0.3588  0
(0.41, 0.78, 0.43, 0.58)  0.5032  1
(0.73, 0.29, 0.44, 0.61)  0.7505  2

Which class, 0, 1, 2, is the item-to-be-classified? Class 1 is the closest, but class 0 is the second and third closest.

The simplest approach is to use uniform voting weights. For k = 5 each weight is 1 / 5 = 0.20 so

Class 0: 2 * 0.20 = 0.40
Class 1: 2 * 0.20 = 0.40
Class 2: 1 * 0.20 = 0.20

This results in a tie between class 0 and class 1. The uniform voting weights approach is equivalent to a simple majority vote.

There are several ways to create weights that give more importance to closer data values. The inverse weights approach computes the inverse of each weight, sums the inverses, then takes each inverse divided by the sum of the inverses. For the data above the weights are:

dist     inverse  wts = inv/sum
===============================
0.1234   8.1037    0.4259  
0.2076   4.8170    0.2532
0.3588   2.7871    0.1465
0.5032   1.9873    0.1044
0.7505   1.3324    0.0700

        19.0275    1.0000

The voting is:

Class 0: 0.2532 + 0.1465 = 0.5303
Class 1: 0.4259 + 0.1044 = 0.3996
Class 2: 0.0700          = 0.0700

And so the conclusion is the item-to-be-classified is class 0.

I usually use the inverse weights approach but there are other weighting techniques for k-NN. One alternative is to use the ranks of the distances and compute rank order centroids. This weights closer labeled data more heavily than the inverse technique. Another approach is to sum the distances, take each distance divided by the sum, then reverse the order of the weights. This penalizes big distances more heavily than the inverse technique.



I find high fashion quite interesting. I couldn’t even begin to assign quality weights to the dresses at a fashion show.

Posted in Machine Learning | Leave a comment

Converting Non-Numeric or Mixed Data to Strictly Numeric Data

Like many topics in machine learning, this idea is a bit tricky to explain so bear with me. My original problem was data clustering. Every standard clustering technique, in particular k-means, requires the source data to be completely numeric because you must compute a distance value between different data items (usually using Euclidean distance).

But what if your data has some non-numeric data. For example, imagine some fake flower data:

blue  5.1  3.5  1.4  0.2
pink  4.9  3.0  1.4  0.4
teal  4.7  3.2  1.3  0.3
etc.

The first value is the color of the flower then the next four values are sepal length, sepal width, petal length, petal width. How do you deal with the color variable if you want to cluster the dataset? If you’re not familiar with clustering you’d think this would be easy, but trust me, it’s not.

My idea is to convert mixed data into strictly numeric data by using a neural autoencoder. I ran a little experiment where I first encoded the color as blue = (1, 0), pink = (0, 1), teal = (-1, -1) then I created a 6-10-8-10-6 autoencoder that accepts the six input values and predicts those same values. After training, the 8 nodes in the central hidden layer are a strictly numeric representation of each flower. . .

At least in theory. My scheme is somewhat related to and based on word embeddings where words are converted to numeric vectors in a roughly similar way.

Anyway, there’s a lot going on here and I’ve found almost zero existing research or practical information along these lines. One of the problems when working with clustering is that a good clustering is very hard to define precisely. But I’ll keep probing away at this problem.



“Good” art is impossible to define precisely. Five examples from artists whose work is often described as kitsch/bad. Beauty is in the eye of the beholder but I think all five paintings are wonderful. “The Green Lady”, Vladimir Tretchikoff. “Lamplight Manor”, Thomas Kinkade. “Tina”, JH Lynch. “Dancers in the Koutoubia Palace”, LeRoy Neiman. “Gypsy Girl”, Charles Roka.

Posted in Machine Learning | Leave a comment

Rating Competitors Using Infer.NET

I wrote an article titled “Rating Competitors Using Infer.NET” in the February 2019 issue of Microsoft MSDN Magazine. See https://msdn.microsoft.com/en-us/magazine/mt833275.

Infer.NET is a code library created by Microsoft Research. The library contains functions that can be used to solve problems that involve probability. In my article, I demonstrate how to use Infer.NET to infer/calculate the ratings of a set of sports teams based on some win-lose data.

Specifically, I set up six teams: Angels, Bruins, Comets, Demons, Eagles, Flyers. Then I set up nine game results:

Angels beat Bruins
Bruins beat Comets
Comets beat Demons
Angels beat Eagles
Bruins beat Demons
Demons beat Flyers
Angels beat Flyers
Comets beat Eagles
Eagles beat Flyers

Then I supplied some math assumptions: the ratings of the teams are Normal (Gaussian) distributed, and the average rating of a team is 2000 with a standard deviation of 200. These assumptions were somewhat arbitrary. Then the Infer.NET program used its built-in functions to find the set of ratings that best match the data. The results were: Angels = 2256.8, Bruins = 2135.3, Comets = 2045.5, Demons = 1908.0, Eagles = 1914.6, Flyers = 1739.7.

This is an example of what’s called maximum likelihood expectation. Working backwards, staring with these ratings, the win – lose data is more likely than if the ratings were different.

The Infer.NET code library, which was developed by researchers, feels different than working with a code library that has been designed by software engineers. This creates some friction for most developers who want to use Infer.NET. My point here is that the functionality of a code library is important, but the interface is important too.

There are other probabilistic programming code libraries and languages, including Stan, Pyro, and Edward, which are the most common. The Wikipedia entry on Probabilistic Programming Languages (PPLs) lists 46 libraries. None of the machine learning guys I work with use probabilistic programming which leads me to believe that PPLs are used mostly by researchers.



College majors. Electrical engineering 88% male. Computer Science 82% male. Physical sciences 62% male. Health professions 85% female. Education 80% female. English 80% female. In the workplace, roughly 75% of computer science employees are male and 25% are female. Therefore, there’s a strong statistical argument that women are over, not under, represented in computer science jobs.

Posted in Machine Learning, Miscellaneous | Leave a comment