Machine Learning with Natural Language

Natural language processing (NLP) is an important area of machine learning (ML). The Hello World problem for NLP is to take a set of text, such as a paragraph or entire book, and then create a model that when given a word in the text, predicts the next word.

Ordinary ML techniques can’t handle such a problem because the next word in some text doesn’t just depend on the previous word, it depends on many previous words. For example, if I asked you what word follows “brown” you’d have to take a wild guess but if I told you the previous words were “the”, “quick”, you’d probably guess the next word is “fox”.

NLP is quite difficult. The first step is to encode the source text because ML systems only understand numbers. One common way (but by no means the only way, or the best way) is to use “one-hot” encoding (also called “1-of-N” encoding. Suppose you have just 10 words in your source text: “There must be some kind of way out of here.” Then there are 9 distinct words (“of” is repeated). The nine words could be encoded as “There” = (1,0,0,0,0,0,0,0,0), “must” = (0,1,0,0,0,0,0,0,0), “be” = (0,0,1,0,0,0,0,0,0), . . . “here” = (0,0,0,0,0,0,0,0,1).

So, when doing NLP, you have to spend a lot of time massaging the source text data. I took a few lines from the James Bond novel “Dr. No” and wrote a utility program that created a text file suitable for use by the CNTK ML code library. The source text is:

Bond watched the big green turtle-backed island grow on the horizon
and the water below him turn from the dark blue of the Cuba Deep to
the azure and milk of the inshore shoals . Then they were over the
North Shore , over its rash of millionaire hotels , and crossing
the high mountains of the interior . The scattered dice of
small-holdings showed on the slopes and in clearings in the jungle
 , and the setting sun flashed gold on the bright worms of tumbling
rivers and streams .

The output for the first three pairs of words is:

|prev 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 |next 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

|prev 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 |next 0 0 1 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

|prev 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 |next 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0

. . .

The utility program scanned the source text and determined how many unique words there are to determine the length for one-hot encoding. Then the source text was scanned again, and each unique word was inserted into a Dictionary object. For example, word_dict[“Bond”] = 39 and word_dict[“the”] = 2. The utility also created a reverse dictionary, for example indx_dict[39] = “Bond”.

NLP can get very, very complicated, but if you’re new to NLP, you just have to learn one step at a time.


“Hot 9” – Jackson Pollock

Advertisements
Posted in CNTK, Machine Learning | Leave a comment

Classification using k-NN Explained in 180 Seconds

I was hosting a technical talk recently and had a few minutes between sessions. So I challenged myself to give a blitz talk, in under three minutes, to explain k-NN classification.

I started by saying that k-NN is one of many classification algorithms, and arguably the simplest, but one that a surprising number of people don’t fully understand.

Next, I pulled up a graph to explain how the algorithm works. In the graph there were 33 data points that were one of three colors (red, yellow, green) representing three classes to predict based on two predictor variables (the x0 and x1 coordinates in the graph). The graph also had a single blue dot at (x0 = 5.25, x1 = 1.75) as an unknown to classify.

In k-NN you pick k — suppose it’s k = 4. Then, you find the 4 nearest neighbor points to the unknown point. And then you use some sort of voting mechanism, usually majority-rule) to predict the class. In the diagram, the blue dot was closest to one red dot, two yellow dots, and one green dot, so the prediction is class “yellow”.

I finished my mini-micro-talk by pointing out a few pros and cons of k-NN classification. Pros: very simple, can easily deal with any number of possible classes, can handle very bizarre data patterns, there’s only one parameter to tune (the value for k), results are somewhat interpretable. Cons: works well only when all predictor variables are numeric (because you must compute distance), ties can easily occur, doesn’t scale well to huge training datasets

I asked someone to time me, and I finished the talk in 2 minutes and 37 seconds. It was a fun challenge for me.

Posted in Machine Learning | Leave a comment

NFL 2017 Week 7 Predictions – Zoltar Likes the Underdogs Again

Zoltar is my NFL football machine learning prediction system. Here are Zoltar’s predictions for week #7 of the 2017 NFL season:

Zoltar:      chiefs  by    1  dog =     raiders    Vegas:      chiefs  by  2.5
Zoltar:       bills  by    4  dog =  buccaneers    Vegas:       bills  by    3
Zoltar:    steelers  by    8  dog =     bengals    Vegas:    steelers  by    6
Zoltar:      titans  by    7  dog =      browns    Vegas:      titans  by  6.5
Zoltar:    panthers  by    0  dog =       bears    Vegas:    panthers  by  3.5
Zoltar:    dolphins  by    6  dog =        jets    Vegas:    dolphins  by    3
Zoltar:   cardinals  by    0  dog =        rams    Vegas:        rams  by  3.5
Zoltar:     vikings  by    6  dog =      ravens    Vegas:     vikings  by  4.5
Zoltar:     packers  by    6  dog =      saints    Vegas:      saints  by    4
Zoltar:       colts  by    6  dog =     jaguars    Vegas:     jaguars  by    3
Zoltar:     cowboys  by   11  dog = fortyniners    Vegas:     cowboys  by    6
Zoltar:    seahawks  by    0  dog =      giants    Vegas:    seahawks  by    6
Zoltar:     broncos  by    0  dog =    chargers    Vegas:     broncos  by    2
Zoltar:    patriots  by    6  dog =     falcons    Vegas:    patriots  by  3.5
Zoltar:      eagles  by    6  dog =    redskins    Vegas:      eagles  by    5

Zoltar theoretically suggests betting when the Vegas line is more than 3.0 points different from Zoltar’s prediction. For week #7 Zoltar has five hypothetical suggestions. As in recent weeks, most (four) of the five are on underdogs, and just one suggestion is for a favorite. It seems like Zoltar-2017 has a bias for underdogs, but his accuracy is good, so I’m not sure if Zoltar has the bias or if Vegas has a bias towards favorites.

1. Zoltar likes the Vegas underdog Bears against the Panthers. Vegas believes the Panthers are 3.5 points better than the Bears, but Zoltar thinks the two teams are evenly matched. A bet on the Bears would pay you if the Bears win (by any score), or the Panthers win, but by 3 points or fewer.

2. Zoltar likes the Vegas underdog Packers against the Saints. Vegas has the Saints as 4.0 point favorites but Zoltar thinks the Packers are 6 points better than the Saints. This difference must be due to the injury to the Packers quarterback. Often, Vegas overreacts to a key injury, but in this case the pessimism is probably justified.

3. Zoltar likes the Vegas underdog Colts against the Jaguars. Vegas believes the Jaguars are 3.0 points better than the Colts, but Zoltar thinks the Colts are 6 points better than the Jaguars. This big difference in opinion is rather unusual.

4. Zoltar likes the Vegas favorite Cowboys against the 49ers. Vegas has the Cowboys as 6.0 points better than the 49ers but Zoltar thinks the Cowboys are 11 points better. So, a bet on the Cowboys will only pay you if the Cowboys win by 7 points or more. (If the Cowboys win by exactly 6 points, the bet is called off).

5. Zoltar likes the Vegas underdog Giants against the Seahawks. Vegas likes the Seahawks as 6.0 point favorites, but Zoltar believes that the two teams are evenly matched.

Note: There’s some weirdness with the Rams vs. Cardinals game — my output says the Vegas line is a pick ’em but my data file says Vegas favors the Rams by 3.5 points. I’ll have to walk through Zoltar’s code to find the problem — almost certainly related to the fact that the game is being played at a neutral site (London).

==

Week #6 was very unusual. Against the Vegas spread, which is what Zoltar is designed to do, Zoltar went a nice 4-2. Zoltar correctly liked Vegas underdogs Dolphins, Jets, Giants, and Vikings, but incorrectly liked Vegas underdog Lions and Vegas favorite Raiders.

For the 2017 season so far, against the Vegas point spread, Zoltar is a pretty good 17-8 (68% accuracy). If you must bet $110 to win $100 (typical in Vegas) then you must theoretically predict with 53% or better accuracy to make money, but realistically you must predict at 60% or better accuracy.

Just for fun, I also track how well Zoltar does when only predicting which team will win. This isn’t really useful except for parlay betting. For week #6, Zoltar was a hideously bad 4-10 just predicting winners.

For comparison purposes, I also track how well Bing and the Vegas line do when just predicting who will win. In week #6, Bing was slightly better (but still weak) than Zoltar at 5-9, and Vegas was also bad at 5-9 just predicting winners.

For the 2017 season so far, just predicting the winning team, Zoltar is 54-37 (59% accuracy), Bing is 50-41 (55% accuracy), and Vegas is 47-42 (53% accuracy). The Vegas accuracy number is the lowest I’ve seen in the last 10 years suggesting that most NFL teams are just about equal in strength.


Original Zoltar from the 1988 movie “Big”

Posted in Machine Learning, Zoltar | 1 Comment

Getting MNIST Data into a Text File

The MNIST image data set is used as the “Hello World” example for image recognition in machine learning. The dataset has 60,000 training images to create a prediction system and 10,000 test images to evaluate the accuracy of the prediction model.

Each gray scale image represents a single, hand-drawn digit from 0 to 9. Each image is 28 x 28 pixels, where each pixel value is between 0 (pure white) and 254 (pure black).

There are a total of four files located at http://yann.lecun.com/exdb/mnist/. The first file has 60,000 labels (0 to 9). The second file has the corresponding pixel values for each image (28 * 28 = 784 values per image). The third file has the 10,000 training labels (0 to 9). The fourth file has the corresponding pixel values.

The raw data files are stored zipped and in a proprietary, binary format. In order to use MNIST data, you must convert the binary data into text data. I’ve seen many utility programs to do the conversion, usually written in Python, and most are incomprehensible. I set out to write the simplest conversion utility possible.

The first decision is to choose a format for the resulting text file. I arbitrarily decided that I wanted the result to look like this:

digit 7
pixls 0 0 25 253 . . .

digit 2
pixls 0 0 127 84 . . .

digit 9
pixls 0 0 0 172 . . .

etc.

The next step is to download the four zipped data files from the URL above. Next you have to unzip the files. The files are in .gz format which Windows can’t handle so I needed the 7-Zip utility. After installing it, I right-clicked on each .gz file, selected “7-Zip” then “Extract files”, and unzipped to a directory I named Unzipped. In order to keep things clear, I added a “.bin” extension to each unzipped file so I could remember they’re in binary.

The next step is to write the utility function. Here’s the code:

# converter_mnist.py

def convert(img_file, label_file, txt_file, n_images):
  lbl_f = open(label_file, "rb")   # MNIST has labels (digits)
  img_f = open(img_file, "rb")     # and pixel vals separate
  txt_f = open(txt_file, "w")      # output file to write to

  img_f.read(16)   # discard header info
  lbl_f.read(8)    # discard header info

  for i in range(n_images):   # number images requested 
    lbl = ord(lbl_f.read(1))  # get label (unicode, one byte) 
    txt_f.write("digit " + str(lbl) + "\n")
    txt_f.write("pixls ")
    for j in range(784):  # get 784 vals from the image file
      val = ord(img_f.read(1))
      txt_f.write(str(val) + " ")  # will leave a trailing space 
    txt_f.write("\n")  # next image

  img_f.close(); txt_f.close(); lbl_f.close()

def main():
  convert(".\\Unzipped\\t10k-images.idx3-ubyte.bin",
          ".\\Unzipped\\t10k-labels.idx1-ubyte.bin",
          "mnist_test.txt", 3)

if __name__ == "__main__":
  main()

I made the code as simple as I could. If you are new to machine learning, the ability to work with MNIST data is important. And you need code like this to get the raw, zipped, binary MNIST data into a usable format.

Posted in Machine Learning | 1 Comment

The Radial Basis Function Kernel

The radial basis function (RBF) kernel is . . . Well, let me back up a moment. When I want to know what a machine learning concept is, I want to know four things. First, what it is in a single sentence. Second, the formal or math definition. Third, a concrete example of how to calculate or compute it. Fourth, to understand what it is used for.

So,

1. An RBF kernel is a measure of similarity between two numeric vectors.

2. The math definition is: (see image below)

3. An example calculation is: (see image below)

4. RBF kernel functions are used in many areas of ML, including support vector machine (SVM) classification, RBF networks, and several so-called “kernel methods”.

I estimate there are roughly 100 key machine learning concepts. To be sure there are many more than 100 important ML concepts, but I figure there are about 100 that are absolutely essential knowledge.

For me, learning usually occurs from specific to general. I learn very specific techniques, and over time, come to understand higher level concepts, theory, and relationships.

From my days as a university professor, I noticed a clear distinction between students who learn specific-to-general like I mostly do, and those who learn general-to-specific. Whenever I teach something to a group of people, I try to keep this idea in mind and present things in both ways as much as possible.


“Rialto Bridge” – Canaletto, 1746. Multiple radial structures.

Posted in Machine Learning | Leave a comment

Cross Entropy Error – General Case vs. Neural Classifier Case

Beginners to machine learning are sometimes confused by cross entropy error. Cross entropy error is also called log loss. In the general case, cross entropy error is a measure of error between a set of predicted probabilities and a set of actual probabilities.

Cross entropy error is calculated as “the negative of the sum of the log of the predicteds times the associated actuals.” For example, if a set of predicted probabilities is (0.20, 0.70, 0.10) and the associated actual probabilities are (0.25, 0.45, 0.30) then CE error is:

CE = - [ log(0.20)*0.25 + log(0.70)*0.45 + log(0.10)*0.30 ]
   = 1.254

But in neural network classification, the actual probabilites are the encoded target class label, which has the form of one 1-value and the rest 0-values. So if predicted probabilities are as before: (0.20, 0.70, 0.10) and the class labels are (0, 1, 0) then the CE is:

CE = - [ log(0.20)*0 + log(0.70)*1 + log(0.10)*0 ]
   = - [ 0 + (-0.357) + 0 ]
   = 0.357

Because all of the actual probabilities except one are 0, all but one term drop out. This just doesn’t seem correct but it is.

And in the case of binary classification, the CE equation can be further simplified. Suppose the target probabilities are (1, 0) = (y, 1-y) and the predicted probabilities are (0.70, 0.30) = (y’, 1-y’). The CE error reduces to just -log(y’)*y.

Like many things on the road to machine learning mastery, this is something that seems surprising at first but quickly becomes wired-in knowledge.

Posted in Machine Learning | Leave a comment

The Beta Distribution in Machine Learning

The beta distribution appears in several machine learning topics. Like many math distributions, the beta distribution is both simple (to use) and complex (to fully understand).

The beta distribution is best explained by starting with an example. I’ll use Python because, annoyingly, there’s no built-in beta function for C#.

import numpy as np  # beta() is in here
np.random.seed(1)   # make reproducible
p1 = np.beta(a=1, b=1)  # probability1
p2 = np.beta(a=1, b=1)
p3 = np.beta(a=1, b=1)

This code will return three random probability values. Each will be between 0.0 and 1.0 and be uniformly distributed with an average of 0.5.

The a and b parameters, often called alpha and beta in math literature (which is absolutely terrible because now “beta” has two meanings — the distribution and the parameter) define how the distribution works. It’s similar to the way the Normal, Gaussian, bell-shaped distribution has two parameters, mean and standard deviation, that define what kind of values you get.

For the beta distribution, you always get a probability value between 0.0 and 1.0 where the average probability returned is a / (a + b). When a = 1 and b = 1, the average return value is 1 / (1 + 1) = 0.5 which is a uniform distribution.

Suppose a = 3 and b = 1. The average return probability value will be 3 / (3 + 1) = 0.75 so most returned values will be greater than 0.75 even though there’s a chance to get any value. Here’s a graph of pulling 10,000 samples from Beta(a=3, b=1).

So, that’s pretty easy. But why would the beta distribution ever be useful? This is much harder to explain and would take several pages so briefly . . .

Suppose you are observing some random process that emits a series of “success” or a “failure” over time. You start without any knowledge and so you assume P(success) = P(failure) = 0.5. But then you observe: t = 1 success, t = 2 success, t = 3 failure, t = 4 success, t = 5 failure. What is coming next at t = 6?

Using beta, initially a = b = 1. You have 3 success and 2 failure, Set a = 1 + 3 = 4 and b = 1 + 2 = 3. The probability of success for t = 6 is a / (a + b) = 4 / (4 + 4) = 4/7 = 0.5714 and you could sample possible outcomes using beta.

This should give you a hint of what the beta distribution is. For a complete explanation, the Wikipedia entry on the topic is very thorough.


Simulation of beta particle decay in Physics

Posted in Machine Learning | Leave a comment