Understanding k-NN Classification using C#

I wrote an article titled “Understanding k-NN Classification using C#” in the December 2017 issue of Microsoft MSDN Magazine. See https://msdn.microsoft.com/en-us/magazine/mt814421.

The goal of k-NN (“k nearest neighbors”) classification is to predict the class of an item based on two or more predictor variables. For example, you might want to predict the political leaning (conservative, moderate, liberal) of a person based on their age, income, years of education, and number of children.

The technique is very simple. You obtain a set of training data that has known input and class values. Then for an unknown item, you find the k nearest training data points, and then predict the most common class.

In the image below, there are three classes, indicated by the red, green, and yellow data points. Each item has two predictor variables. The blue dot is the unknown. If you set k = 4, the four closest points to the blue dot are the red at (5,3), the yellow at (4,2), the yellow at (4,1), and the green at (6,1). The most common class is yellow, so you predict yellow.

Compared to other classification algorithms, the advantages of k-NN classification include: easy to implement, can be easily modified for specialized scenarios, works well with complex data patterns, and the results are somewhat interpretable.

Disadvantages of k-NN classification include: the result can be sensitive to your choice of the value for k, the technique works well only when all predictor variables are strictly numeric, it’s possible to get a tie result prediction, and the technique doesn’t work well with huge training data sets.

Advertisements
Posted in Machine Learning | Leave a comment

NFL 2017 Week 14 Predictions – Zoltar Likes Two Favorites and One Dog

Zoltar is my NFL football machine learning prediction system. It’s a hybrid system that uses a custom reinforcement learning algorithm plus a neural network. Here are Zoltar’s predictions for week #14 of the 2017 NFL season:

Zoltar:     falcons  by    2  dog =      saints    Vegas:      saints  by    1
Zoltar:      texans  by    9  dog = fortyniners    Vegas:      texans  by    3
Zoltar:       bills  by    6  dog =       colts    Vegas:       bills  by    4
Zoltar:     packers  by   12  dog =      browns    Vegas:     packers  by  3.5
Zoltar:     bengals  by    6  dog =       bears    Vegas:     bengals  by  6.5
Zoltar:       lions  by    0  dog =  buccaneers    Vegas:       lions  by    1
Zoltar:     cowboys  by    3  dog =      giants    Vegas:     cowboys  by    5
Zoltar:     vikings  by    0  dog =    panthers    Vegas:     vikings  by    3
Zoltar:      chiefs  by    2  dog =     raiders    Vegas:      chiefs  by    4
Zoltar:     broncos  by    2  dog =        jets    Vegas:        jets  by    1
Zoltar:    chargers  by    4  dog =    redskins    Vegas:    chargers  by    6
Zoltar:      titans  by    1  dog =   cardinals    Vegas:      titans  by    3
Zoltar:      eagles  by    0  dog =        rams    Vegas:        rams  by    2
Zoltar:    seahawks  by    0  dog =     jaguars    Vegas:     jaguars  by    3
Zoltar:    steelers  by    7  dog =      ravens    Vegas:    steelers  by    7
Zoltar:    patriots  by    6  dog =    dolphins    Vegas:    patriots  by   11

Zoltar theoretically suggests betting when the Vegas line is more than 3.0 points different from Zoltar’s prediction. In previous years, Zoltar would typically have hypothetical suggestions for only two or three games per week. But this season Zoltar and the Las Vegas line have differed greatly, and so Zoltar has been recommending six to eight games per week. The difference this season is the unusually large number of injuries to key players. (Note: I don’t actually bet on games — for me the whole point is the machine learning algorithms. Well, OK, I’ll place a bet every now and then, just for fun).

This week, Zoltar is somewhat back to normal and has just three hypothetical suggestions.

1. Zoltar likes the Vegas favorite Texans over the 49ers. Vegas has the Texans as a moderate favorite by 3.0 points, but Zoltar believes the Texans are a large 9 points better than the 49ers. A bet on the Texans will only pay off if the Texans win by more than 3.0 points (in other words, 4 points or more — if the Texans win by exactly 3 points the game is a push).

2. Zoltar likes the Vegas favorite Packers over the Browns. Vegas has the Packers as a moderate favorite by 3.5 points, but Zoltar believes the Packers are a huge 12 points better than the Browns.

3. Zoltar likes the Vegas underdog Dolphins against the Patriots. Vegas thinks the Patriots are a huge 11.0 points better than the Dolphins, but Zoltar computes that the Patriots are only 6 points better. Note: I have an advanced version of Zoltar, and advanced Zoltar thinks the Patriots are 14 points better than the Dolphins, and so does not recommend any action on this game.

==

Zoltar had a pretty good week last week. Against the Vegas spread, which is what Zoltar is designed to predict, Zoltar went a nice 4-2. (Advanced Zoltar went 3-1 in week #13).

For the 2017 season so far, against the Vegas point spread, Zoltar is a decent-but-not-great 40-24 (62% accuracy). If you must bet $110 to win $100 (typical in Vegas) then you must theoretically predict with 53% or better accuracy to make money, but realistically you must predict at 60% or better accuracy.

Just for fun, I also track how well Zoltar does when only predicting which team will win. This isn’t really useful except for parlay betting. For week #13, Zoltar was a very good 13-3 just predicting which team would win

For comparison purposes, I also track how well Bing Predicts and the Vegas line do when just predicting which team will win. In week #13, Bing was a good 12-4, and Vegas was a weak 11-5 when just predicting winners.

For the 2017 season so far, just predicting the winning team, Zoltar is 131-61 (68% accuracy), Bing is 127-65 (66% accuracy), and Vegas is 120-65 (65% accuracy). The best humans are typically about 67% accurate predicting winners, so currently, Zoltar and Bing are just about as good as the best humans when just predicting which team will win.


My system is named after the Zoltar arcade fortune teller machine

Posted in Machine Learning, Zoltar | Leave a comment

The Akka.NET Framework

I was taking a deep look at the Akka.NET framework recently. See https://getakka.net/. The Akka.NET framework is an open source library of C# language code intended to help developers to create distributed, asynchronous systems. My exploration made me realize how brilliant the Microsoft .NET Framework is.

Let me explain. A distributed system is one where processing (and data) are on two or more machines. Writing distributed systems is extremely difficult.

An asynchronous system is one where entities are communicating with each other simultaneously. Writing asynchronous systems is typically incredibly difficult.

So, writing distributed asynchronous systems is basically nightmarishly difficult.

The key idea of Akka.NET is the notion of an Actor object. All key entities are Actors that communicate through asynchronous messaging.

The hard part about learning Akka.NET is not so much the code, it’s more about thinking in a completely new programming paradigm.

The idea of programming with Actors has been around since the 1970s. Many programming languages have been influenced by the idea. The original Akka library is an open source project, written in Java, first released in 2009. The Akka.NET framework is a port of the Java version.

One thing my exploration of Akka.NET made me realize is how brilliant the C# language (and by inference, the .NET Framework) are — C# is at exactly the right level of abstraction to allow direct code implementation, or facilitate extension via libraries. I enjoy writing all kinds of software systems, on all kinds of platforms, but Microsoft technologies are (with a few minor exceptions) vastly superior to non-Microsoft technologies (macOS, Linux, Java, etc.), at least for most of the work I do.




Actor James McCaffrey bears an astonishing resemblance to my younger brother Roger, leading me to believe my brother and I are related to the actor in some way.

Posted in Miscellaneous | Leave a comment

Roulette Wheel Selection Algorithm

Roulette Wheel Selection is a common algorithm used to select an item proportional to its probability. Suppose you have four items, numbered (0, 1, 2, 3). Suppose you want the probabilities of each item being selected to be (0.2, 0.4, 0.3, 0.1) respectively.

So, you want item [0] to be selected with probability prob = 0.20, and item [1] to be selected with prob = 0.40, item [2] with prob = 0.30, and item [3] with prob = 0.10.

Conceptually you construct an array of cumulative probabilities:

(0.20, 0.60, 0.90, 1.00)

Then you generate a random value p between 0.0 and 1.0. If p is between (0.0, 0.20) you select [0]. If p is between (0.20, 0.60) you select [1]. If p is between (0.60, 0.90) you select [2]. If p is between (0.90, 1.00) you select [3].

The algorithm is actually quite deep. In practice you don’t have to compute the array of cumulative probabilities — you can just accumulate on the fly. And you have to be careful about boundary values, for example, in the example above, if p is exactly 0.60, do you select [1] or [2]?

Anyway, while I was looking at the pursuit algorithm for the multi-armed bandit problem, I realized I’d be needing a RWS routine, so I coded one up. In my demo, I used probs = (0.2, 0.4, 0.3, 0.1) and generated 1,000 index values, so, I’d expect to get counts of about (200, 400, 300, 100) respectively.

Good fun!


Ursula Andress in Casino Royale (1967)

Posted in Machine Learning | 2 Comments

The Epsilon-Greedy Algorithm

The epsilon-greedy algorithm (often written using the actual Greek letter epsilon, as in the image below), is very simple and occurs in several areas of machine learning. One common use of epsilon-greedy is in the so-called multi-armed bandit problem.

Suppose you are standing in front of k = 3 slot machines. Each machine pays out according to a different probability distribution, and these distributions are unknown to you. And suppose you can play a total of 100 times.

You have two goals. The first goal is to experiment with a few coins to try and determine which machine pays out the best. The second, related, goal is to get as much money as possible. The terms “explore” and “exploit” are used to indicate that you have to use some coins to explore to find the best machine, and you want to use as many coins as possible on the best machine to exploit your knowledge.

Epsilon-greedy is almost too simple. As you play the machines, you keep track of the average payout of each machine. Then, you select the machine with the highest current average payout with probability = (1 – epsilon) + (epsilon / k) where epsilon is a small value like 0.10. And you select machines that don’t have the highest current payout average with probability = epsilon / k.

It much easier to understand with a concrete example. Suppose, after your first 12 pulls, you played machine #1 four times and won $1 two times and $0 two times. The average for machine #1 is $2/4 = $0.50.

And suppose you’ve played machine #2 five times and won $1 three times and $0 two times. The average payout for machine #2 is $3/5 = $0.60.

And suppose you’ve played machine #3 three times and won $1 one time and $0 two times. The average payout for machine #3 is $1/3 = $0.33.

Now you have to select a machine to play on try number 13. You generate a random number p, between 0.0 and 1.0. Suppose you have set epsilon = 0.10. If p > 0.10 (which it will be 90% of the time), you select machine #2 because it has the current highest average payout. But if p < 0.10 (which it will be only 10% of the time), you select a random machine, so each machine has a 1/3 chance of being selected.

Notice that machine #2 might get picked anyway because you select randomly from all machines.

Over time, the best machine will be played more and more often because it will pay out more often. In short, epsilon-greedy means pick the current best option ("greedy") most of the time, but pick a random option with a small (epsilon) probability sometimes.

There are many other algorithms for the multi-armed bandit problem. But epsilon-greedy is incredibly simple, and often works as well as, or even better than, more sophisticated algorithms such as UCB ("upper confidence bound") variations.


“Greed” (about 1500) – Hieronymus Bosch

Posted in Machine Learning

Supervised, Unsupervised, and Reinforcement Learning

The vocabulary associated with machine learning can be quite confusing. One way to categorize the hundreds of ML techniques is to think about supervised learning techniques, unsupervised learning techniques, reinforcement learning techniques, and “other” learning techniques.

Briefly, in supervised learning, you use labeled training data to create a prediction model. In unsupervised learning, you use unlabeled data to try and find structure in the data. In reinforcement learning, you use feedback and a reward function to learn a policy for decision making.

In supervised learning, you start with training data that has input values and known correct output values. Then you use the training data to create a trained model, which often takes the form of a complex prediction equation. For example, suppose you want to predict the political leaning (conservative, moderate, liberal) of a person based on their age, income, years education, and number of children. Your raw data might look like:

45, 98000.00, 14, 3, conservative
22, 23000.00, 11, 0, liberal,
. . .

Using this labeled data, you could create a neural network classifier to predict the political leaning of a new, previously unseen person.

In unsupervised learning, you start with data but don’t have known, correct input and output values. You analyze the unlabeled data to try and find structure in the data. For example, you might have data about college students: their high school GPA, their college GPA, their SAT score, and so on. You could find clusters that the data falls into, and then manually examine those clusters to see if any interesting patterns emerged. Maybe you’d see that those students with medium high school GPAs and high SAT scores, group together for some reason.

In reinforcement learning, you don’t start with data. Instead you have some sort of a reward function, and you repeatedly try experiments in order to learn the best action for any given state. For example, suppose you are trying to make an autonomous robot navigate through a maze. You run the robot through the maze, over and over. You give the system a reward depending on how close the robot gets to the end of the maze. After many trials you have a policy that instructs the robot what to do at any given location.

Now, I’ve taken a lot of liberties with these very informal descriptions, to try and keep things simple. And there are many exceptions and important details I’ve left out. But, in short, supervised learning uses labeled training data. Unsupervised learning uses unlabeled raw data. Reinforcement learning uses a reward function instead of data.


Odalisque Sur Fond Rouge (1929) – Henri Matisse (a hand-colored pochoir created under Matisse’s supervision)

Posted in Machine Learning

NFL 2017 Week 13 Predictions – Zoltar Has Six Suggestions

Zoltar is my NFL football machine learning prediction system. It’s a hybrid system that uses a custom reinforcement learning algorithm plus a neural network. Here are Zoltar’s predictions for week #13 of the 2017 NFL season:

Zoltar:     cowboys  by    6  dog =    redskins    Vegas:     cowboys  by  1.5
Zoltar:     packers  by    6  dog =  buccaneers    Vegas:     packers  by    2
Zoltar:    dolphins  by    6  dog =     broncos    Vegas:     broncos  by    1
Zoltar:    patriots  by    5  dog =       bills    Vegas:    patriots  by    8
Zoltar:     falcons  by    2  dog =     vikings    Vegas:     falcons  by    3
Zoltar:       bears  by    6  dog = fortyniners    Vegas:       bears  by  3.5
Zoltar:     jaguars  by    6  dog =       colts    Vegas:     jaguars  by  9.5
Zoltar:      ravens  by    2  dog =       lions    Vegas:      ravens  by    3
Zoltar:      chiefs  by    4  dog =        jets    Vegas:      chiefs  by  3.5
Zoltar:      titans  by    6  dog =      texans    Vegas:      titans  by  6.5
Zoltar:    chargers  by   13  dog =      browns    Vegas:    chargers  by   13
Zoltar:     raiders  by    7  dog =      giants    Vegas:     raiders  by  7.5
Zoltar:      saints  by    6  dog =    panthers    Vegas:      saints  by    4
Zoltar:        rams  by    0  dog =   cardinals    Vegas:        rams  by    7
Zoltar:      eagles  by    0  dog =    seahawks    Vegas:      eagles  by    4
Zoltar:    steelers  by    5  dog =     bengals    Vegas:    steelers  by    6

Zoltar theoretically suggests betting when the Vegas line is more than 3.0 points different from Zoltar’s prediction. In previous years, Zoltar typically would recommend two or three games, but this season Zoltar and the Las Vegas line have differed greatly, and so Zoltar has been recommending six to eight games per week — the large number of injuries to key players is the primary factor for the difference this season. This week, Zoltar has six hypothetical suggestions.

1. Zoltar likes the Vegas favorite Cowboys over the Redskins. Vegas has the Cowboys as slim 1.5 point favorites but Zoltar thinks the Cowboys are 6 points better than the Redskins. So, a bet on the Cowboys will pay you only if the Cowboys win by more than 1.5 points (in other words, 2 points or more).

2. Zoltar likes the Vegas favorite Packers against the Buccaneers. Vegas thinks the Packers are just 2.0 points better than the Buccaneers.

3. Zoltar likes the Vegas underdog Dolphins against the Broncos. Vegas has the Broncos as 1.0 point favorites, but Zoltar thinks the Dolphins are 6 points better than the Broncos. A bet on the Dolphins will pay off if the Dolphins win by any number of points, or if the Broncos win, but by less than 1 point — which is impossible, so the game is really a toss up (if the Broncos win by exactly 1 point, the game is a push).

4. Zoltar likes the Vegas underdog Colts against the Jaguars.

5. Zoltar likes the Vegas underdog Cardinals against the Rams.

6. Zoltar likes the Vegas underdog Seahawks against the Eagles.

This NFL season has been extraordinarily more volatile than any season in recent memory. There are key injuries, literally every week, and the Vegas point spreads change rapidly and in large amounts.

==

Zoltar had a so-so week last week. Against the Vegas spread, which is what Zoltar is designed to predict, Zoltar went 4-3. (I have an advanced version of Zoltar that takes injuries into account — advanced Zoltar went 1-0 in week #13).

For the 2017 season so far, against the Vegas point spread, Zoltar is a mediocre 36-22 (62% accuracy). If you must bet $110 to win $100 (typical in Vegas) then you must theoretically predict with 53% or better accuracy to make money, but realistically you must predict at 60% or better accuracy.

Just for fun, I also track how well Zoltar does when only predicting which team will win. This isn’t really useful except for parlay betting. For week #12, Zoltar was an excellent 14-2 just predicting which team would win

For comparison purposes, I also track how well Bing and the Vegas line do when just predicting who will win. In week #12, Bing was a good 12-4, and Vegas was also good at 13-3 when just predicting winners.

For the 2017 season so far, just predicting the winning team, Zoltar is 118-58 (67% accuracy), Bing is 115-61 (65% accuracy), and Vegas is slightly behind at 109-60 (64% accuracy). The best humans are typically about 67% accurate predicting winners, so currently, both Zoltar and Bing are just about as good as the best humans when just predicting which team will win.


My Zoltar is named after the arcade fortune teller machine, which in turn is named after a machine from the 1988 movie “Big”

Posted in Machine Learning, Zoltar