The Naive Bayes technique can be used for binary classification, for example predicting if a person is male or female based on predictors such as age, height, weight, and so on), or for multiclass classification, for example predicting if a person is politically conservative, moderate or liberal based on predictors such as annual income, sex, and so on. Naive Bayes classification can be used with numeric predictor values, such as a height of 5.75 feet, or with categorical predictor values such as a color of “red”.

In the article I explain how to create a naive Bayes classification system when the predictor values are numeric, using the C# language without any special code libraries. In particular, the goal of the demo program was to predict the gender of a person (male = 0, female = 1) based on their height, weight, and foot length. After creating a prediction model, the demo set up a new data item to classify, with predictor values of height = 5.60, weight = 150, foot = 8.

The probability that the unknown person is male was 0.62 and the probability of female was 0.38, therefore the conclusion was the unknown person is most likely male.

The naive Bayes classification technique has “naive” in its name because it assumes that each predictor value is mathematically independent. Naive Bayes classification with numeric data makes the additional assumption that all predictor variables are Gaussian distributed. This assumption is sometimes not true. For example, the ages of people in a particular profession could be significantly skewed or even bimodal. In spite of these assumptions, naive Bayes classification often works quite well.

*From an Internet search for naive characters in film. Princess Giselle in “Enchanted” (2007), Lorelei in “Gentlemen Prefer Blondes” (1953), Jade in “The Hangover” (2009), Cher in “Clueless” (1995).*

```
Zoltar: steelers by 0 dog = browns Vegas: browns by 2.5
Zoltar: panthers by 6 dog = falcons Vegas: panthers by 6
Zoltar: colts by 6 dog = jaguars Vegas: colts by 3
Zoltar: cowboys by 0 dog = lions Vegas: cowboys by 4
Zoltar: bills by 0 dog = dolphins Vegas: bills by 5.5
Zoltar: vikings by 9 dog = broncos Vegas: vikings by 10
Zoltar: ravens by 4 dog = texans Vegas: ravens by 4
Zoltar: saints by 7 dog = buccaneers Vegas: saints by 5
Zoltar: redskins by 5 dog = jets Vegas: redskins by 1.5
Zoltar: fortyniners by 10 dog = cardinals Vegas: fortyniners by 13.5
Zoltar: patriots by 0 dog = eagles Vegas: patriots by 3.5
Zoltar: raiders by 9 dog = bengals Vegas: raiders by 10.5
Zoltar: rams by 6 dog = bears Vegas: rams by 6.5
Zoltar: chiefs by 3 dog = chargers Vegas: chiefs by 3.5
```

Zoltar theoretically suggests betting when the Vegas line is more than 3.0 points different from Zoltar’s prediction. For week #11 Zoltar has five hypothetical suggestions.

1. Zoltar likes the Vegas underdog Lions against the Cowboys. Zoltar thinks the two teams are evenly matched but Vegas has the Cowboys favored by 4.0 points. A bet on the Lions will pay off if the Lions win by any score or if the Cowboys win but by less than 4 points (i.e., 3 points or less). If the Cowboys win by exactly 4 points the bet is a push.

2. Zoltar likes the Vegas underdog Dolphins against the Bills. Zoltar thinks the two teams are evenly matched but Vegas has the Bills favored by 5.5 points.

3. Zoltar likes the Vegas underdog Cardinals against the 49ers. Zoltar thinks the 49ers are a big 10 points better than the Cardinals but Vegas thinks the 49ers are a very big 13.5 points better.

4. Zoltar likes the Vegas underdog Eagles against the Patriots. Zoltar thinks the two teams are evenly matched but Vegas has the Patriots favored by 3.5 points.

5. Zoltar likes the Vegas favorite Redskins against the Jets. Zoltar thinks the Redskins are 5 points better than the Jets but Vegas thinks the Redskins are only 1.5 points better. A bet on the Redskins will pay off only if the Redskins win by more than 1.5 points.

Theoretically, if you must bet $110 to win $100 (typical in Vegas) then you’ll make money if you predict at 53% accuracy or better. But realistically, you need to predict at 60% accuracy or better.

Zoltar did OK in week #10. Against the Vegas point spread, Zoltar was a reasonable 2-1. Zoltar correctly liked Vegas underdogs Steelers and Seahawks, both of whom won outright. Zoltar missed by predicting the Saints would cover the spread but the Saints lost badly to the Falcons.

For the 2019 season, through week #10, Zoltar is 33-20 (62% accuracy) against the Vegas spread.

Just for fun, I track how well Zoltar and Las Vegas do when just trying to predict only which team will win (but not by how much). This isn’t useful except for parlay betting.

Just predicting winners, Zoltar was a weak 8-5. But Vegas was even worse at 5-8 just predicting winners.

For the season Zoltar is a pretty decent 97-50 (66%) just picking winners and Vegas is at 92-52 (64%).

Note: Vegas has had three pick’em games so far and there has been one tie game.

*My system is named after the Zoltar fortune telling machine you can find in arcades. Here are Zoltar and three nice art nouveau style paintings of gypsy fortune tellers.*

With a List data structure it’s easy to know exactly where any child or parent node is. Suppose you have a tree with seven nodes:

0 1 2 3 4 5 6

If each node has an ID i, where root = 0, left child of root = 1, right child of root = 2 and so on then:

The left child of i is located at index [2i + 1]

The right child of i is located at index [2i + 2]

If i is an odd number (i % 2 != 0) the node is a left child.

If i is even the node is a right child

A left child parent is at [(i-1) / 2]

A right child parent is at [(i–2) / 2]

Simple, easy, and efficient.

To traverse a tree implemented as a List, you just walk through the List in order:

for i = 0 to numNodes display(tree[i]) end-for

The will display the tree level by level: (0, 1, 2, 3, 4, 5, 6). Traversing level-by-level is perfectly fine for most problem scenarios. But suppose you want to traverse/display the tree in what’s called an inorder manner. This is a common ordering because it’s easy to do for a recursive tree:

display(root) if root != null display(root->left) print(root) display(root->right end-if end-display

For the seven-node tree above, the nodes would be printed as 3 1 4 0 5 2 6. To print a tree implemented as a List, you need to use a Stack and do a little work. I hadn’t looked at this problem in a long time so I decided to code up a demo to see if I remembered the algorithm. I did.

*Displaying a tree implemented using a List in an inorder manner.*

When I was a college professor I used to enjoy teaching students how to implement a tree data structure using recursion because the technique is fascinating. But I always told my students that knowing how to use recursion is fine, but in a production environment you should avoid recursion if possible — as a rule of thumb, recursive functions are tricky, error-prone, and difficult to maintain or modify.

*Left: Some trees in November on the street where I live — naturally beautiful. Center: An old-style paint-by-numbers painting of trees — oddly attractive. Right: A huge alien tree on an alien world (unknown artist) — very creative.*

Naive Bayes classification can be used for numeric data, such as predicting the sex of a person who has height = 6.00′, weight = 185 lbs, foot = 9 inches. Naive Bayes can also be used for categorical data such predicting the sex of a person who has height = tall, weight = medium, foot = normal. The underlying theory is the same for the numeric data and categorical data scenarios, but the details are quite a bit different.

*My demo program uses the data from the Wikipedia page on naive Bayes. There are 8 items. Each item is the height, weight, and foot size of a male or female. The goal is to predict the sex of a person who is 6.00 feet tall, weighs 130 lbs and has foot size 8 inches. The result is P(female) = 0.9999884.*

A few day ago I reviewed an example with numeric data on the Wikipedia page on naive Bayes. I verified the Wikipedia calculations by performing the calculations myself, using Excel.

Just for fun I decided to perform the calculations using a C# program. It was an interesting exercise. I didn’t have any major problems because I’m quite familiar with naive Bayes for numeric data. The technique assumes that all data is Gaussian distributed and the technique uses the Gaussian probability distribution function, which I’m also very familiar with.

*Here is the same problem, solved using Excel.*

While I was reviewing the details of how naive Bayes classification works, I came across a technique called Bayes point machine classification. I spent a couple of hours trying to make sense of the little information I found on the Internet, including the source research paper. As far as I can tell, the Bayes point machine is yet another example of prolific research efforts that are overly complex solutions in search of a problem.

The fact that almost nobody uses the Bayes point machine classification technique suggests that it has no advantages over techniques, such as a shallow neural network, that are much simpler. I could be wrong however. The source research paper is very poorly written, in the sense that the paper was not written so that someone could actually implement the technique. So I’ll need to probe a bit deeper before I’m satisfied that Bayes point machine classification is in fact a dead end.

*Robert K. Abbett (1926-2015) was a prolific artist who did the covers of many paperback novels in the 1960s. I like his style of art a lot. I’ve read “Thuvia, Maid of Mars”, by Edgar Rice Burroughs — an excellent novel. I haven’t read the other two books, but I suspect the cover art for them is better than the content. “When she crashed into his house, about all she wore was a guilty look.” Brilliant — modern day Shakespeare.*

1.0, 2.0, 3.0, 0 4.0, 5.0, 6.0, 0 7.0, 8.0, 9.0, 1 10.0, 11.0, 12.0, 1 13.0, 14.0, 15.0, 1 16.0, 17.0, 18.0, 2 19.0, 20.0, 21.0, 3 22.0, 23.0, 24.0, 3

Each row represents a person. There are 4 classes of people indicated by the last value in each row. There are three predictor variables. The data is artificial but you can imagine the four classes represent job type (engineering, sales, management, operations) and the three predictor variables are sick-days, personal-days-off, and vacation-days. The goal is to predict job type from the predictor values.

There are many machine learning techniques you can use to create a prediction model, including numeric naive Bayes, k-NN, neural network classifiers, etc. One of the most basic techniques is to use a decision tree. The final form of a decision tree will be a set of rules like, “if sick < 15.0 and vacation ≥ 12.0 then job-type = 2”.

Creating a decision tree classifier is not too difficult conceptually but the implementation details are very tricky. (This is the opposite of neural networks which are conceptually extremely deep but implementation is not very difficult).

One of the key ideas when creating a decision tree is repeatedly splitting data into two groups so that the two groups have mostly the same class.

The two most common approaches when splitting data for a decision tree are using Shannon entropy and using Gini impurity. Both are measures of disorder in a set of items. I usually prefer to use impurity but sometimes entropy works slightly better. Suppose you have four classes, 0 to 3, and a set of eight items: (0, 0, 1, 1, 1, 1, 1, 3). The Shannon entropy of a set of items is defined as -1 * Sum[p * log2(p)] where p is the probability of each class. So P(0) = 2/8 = 0.25, P(1) = 4/8 = 0.50, P(2) = 0/8 = 0.00, P(3 = 1/8) = 0.125. The sum of the product of each probability times the log to the base 2 of the probability is:

sum = 0.25 * log2(0.25) + 0.50 * log2(0.50) + 0.00 * log2(0.00) + 0.125 * log2(0.125) = 0.25 * -2.0000 + 0.50 * -0.5000 + 0.00 * (na) + 0.125 * -0.3750 = -1.375 entropy = -1 * sum = 1.375

You have to avoid trying to compute log2(0) because log to base anything of zero is negative infinity.

Lower values of entropy mean the data items in a set are mostly the same. Higher values of entropy indicate more disorder (the data items aren’t the same). In the extreme, the entropy for a set of items that are all identical is 0.00 — for decision trees lower entropy is better. The largest possible value of Shannon entropy is log2 of the number of classes. For example, if you had 10 items and they were all different, Shannon entropy is log2(10) = 3.3219.

When creating a decision tree you want to split a set of items into two subsets so that the entropy of the class values is low. You could just try different splits, compute the entropies of the items in each of the two partitions, then take the average. This is OK but has a minor downside that partitions of different size are weighted the same. Therefore, it’s usual to average the two entropy values by the number of items in each partition.

For the dummy data above, suppose you decide to split the eight items into the first three items (0, 0, 1) and the last five items (1, 1, 2, 3, 3). The entropy of the first set is 0.9183. The entropy of the second set is 1.5219. The weighted average is (3/8) * 0.9183 + (5/8) * 1.5219 = 1.2956.

Understanding entropy and disorder is the first step in gaining the knowledge you need to implement a decision tree classifier. Next, you need to understand how to search through your data to find a good split. You can’t try all possible splits because of the combinatorial explosion problem, so you have to use a different approach. I’ll explain in a future post.

*I wonder if people tend to favor one hand over the other when making the OK sign. I always use my left hand for an OK.*

One of the topics in my old article was Braess’s Paradox. Briefly, if you have a road network, adding a new road can actually make travel times worse. The same principle applies to computer networks.

There are several common examples used to illustrate Braess’s Paradox. The image below is one:

Cars must travel from A to D. Suppose there are N = 6 cars. Before the new road addition, 3 cars will take the A-B-D route and 3 cars will take the A-C-D route. The time for a car on the upper A-B-D route will be (10 * 3) + (3 + 50) = 83 minutes. The time for a car on the lower A-C-D route will be (3 + 50) + (10 * 3) = 83 minutes.

Notice that no car will switch routes because it will take longer. For example, suppose one of the cars that takes the upper route decides to take the lower route instead. His travel time will be (4 + 50) + (10 * 4) = 94 minutes. When a system like this is stable, it’s said to be in Nash equilibrium.

Now suppose a new road between B and C is added. Weirdly, equilibrium will be reached when 2 cars use route A-B-D, 2 cars use A-B-C-D, and 2 cars use A-C-D.

The travel time for A-B-D is (10 * 4) + (2 + 50) = 92 minutes.

The travel time for A-B-C-D is (10 * 4) + (2 + 10) + (10 * 4) = 92 minutes.

The travel time for A-C-D is (2 + 50) + (10 * 4) = 92 minutes.

And, although it’s not obvious, if any driver changes routes, he will take longer than 92 minutes. So, the effect of adding a new road is to increase the travel time for every car from 83 minutes to 92 minutes. Note that the situation could be avoided if all the drivers cooperated.

Very cool. Very strange!

*Three colorful sea slugs. Kind of cool. Very strange. But kind of creepy and scary. Left: Nembrotha kubaryana (“neon slug”). Center: Flabellinopsis iodine (“Spanish shawl”). Right: Hypselodoris apolegma (“purple sea slug”).*

```
Zoltar: chargers by 0 dog = raiders Vegas: chargers by 1
Zoltar: bears by 6 dog = lions Vegas: bears by 3
Zoltar: ravens by 10 dog = bengals Vegas: ravens by 10
Zoltar: bills by 0 dog = browns Vegas: browns by 3
Zoltar: packers by 6 dog = panthers Vegas: packers by 5
Zoltar: saints by 18 dog = falcons Vegas: saints by 12.5
Zoltar: jets by 1 dog = giants Vegas: giants by 2
Zoltar: chiefs by 1 dog = titans Vegas: chiefs by 4
Zoltar: buccaneers by 4 dog = cardinals Vegas: buccaneers by 4.5
Zoltar: colts by 10 dog = dolphins Vegas: colts by 10.5
Zoltar: rams by 0 dog = steelers Vegas: rams by 4
Zoltar: cowboys by 4 dog = vikings Vegas: cowboys by 3
Zoltar: seahawks by 0 dog = fortyniners Vegas: fortyniners by 6
```

Zoltar theoretically suggests betting when the Vegas line is more than 3.0 points different from Zoltar’s prediction. For week #10 Zoltar has three hypothetical suggestions.

1. Zoltar likes the Vegas favorite Saints against the Falcons. Zoltar thinks the Saints are a massive 18 points better than the Falcons but Vegas thinks the Saints are only 12.5 points better. A bet on the Saints will pay off only if the Saints win by more than 12.5 points, in other words 13 points or more.

2. Zoltar likes the Vegas underdog Steelers against the Rams. Zoltar thinks the two teams are evenly matched (taking home field advantage into account) but Vegas believes the Rams are 4.0 points better than the Steelers. A bet on the Steelers will pay off if the Steelers win by any score or if the Rams win but by less than 4.0 points (if the Rams win by exactly 4 points the bet is a push).

3. Zoltar likes the Vegas underdog Seahawks against the 49ers. Zoltar inexplicably thinks the two teams are evenly matched even though the 49ers are undefeated and the game is being played in San Francisco. I might have a bug in the system – I need to double check this.

Theoretically, if you must bet $110 to win $100 (typical in Vegas) then you’ll make money if you predict at 53% accuracy or better. But realistically, you need to predict at 60% accuracy or better.

Zoltar did very well in week #9. Against the Vegas point spread, Zoltar was a good 5-2. Zoltar correctly liked Vegas underdogs Cardinals, Dolphins, Ravens and Vegas favorites Texans, Seahawks. Zoltar missed with recommendations on underdogs Redskins and Bears. (Zoltar got incredibly lucky on the Seahawks game — a late point spread move, plus a missed short field goal, plus an overtime touchdown.)

For the 2019 season, through week #9, Zoltar is 31-19 (62% accuracy) against the Vegas spread.

Just for fun, I track how well Zoltar and Las Vegas do when just trying to predict only which team will win (but not by how much). This isn’t useful except for parlay betting.

Just predicting winners, Zoltar was an excellent 14-0. Vegas was a so-so 9-5 just predicting winners. For the season Zoltar is a pretty decent 89-45 (66%) just picking winners and Vegas is almost identical at 87-44 (66%) Note: Vegas has had three pick’em games so far and there has been one tie game. Just picking winners, Vegas is significantly more accurate this year than in any of the previous 20 years.

*My system is named after the Zoltar fortune teller machine you can find in arcades. That machine is named after the machine from the 1988 movie “Big” starring Tom Hanks. And the movie Zoltar is named after the Zoltan arcade machine from the 1960s.*

Data clustering is the process of grouping data items so that similar items are in the same group/cluster and dissimilar items are in different clusters. The most commonly used clustering algorithm is called k-means. The k-means approach is simple and effective, but it doesn’t always work well with a dataset that has skewed distributions.

In my article I present a demo program of a clustering technique called the Gaussian mixture model (GMM). In my demo, I set up eight dummy items to cluster:

{ 0.2, 0.7 } { 0.1, 0.9 } { 0.2, 0.8 } { 0.4, 0.5 } { 0.5, 0.4 } { 0.9, 0.3 } { 0.8, 0.2 } { 0.7, 0.1 }

Each of the dummy data items represents the height and width of a package of some sort. The dimensionality of the problem is 2. The name of the technique comes from the underlying math assumption: each of the variables (height and width in this example) is distributed according to a Gaussian (“normal” or “bell-shaped”) distribution.

*The underlying math for GMM clustering isn’t trivial, which is one of the reasons GMM clustering isn’t used very often.*

In the demo, I specify the number of clusters to place the data items into as K = 3. After applying the Gaussian mixture model clustering technique, the result is:

w (membership wts): 0.9207 0.0793 0.0000 0.9737 0.0263 0.0000 0.9587 0.0413 0.0000 0.0015 0.9962 0.0023 0.0000 0.9430 0.0570 0.0000 0.3848 0.6152 0.0000 0.1806 0.8194 0.0000 0.2750 0.7250

The first row of the results corresponds to the first data item. It means the probability that the data item belongs to cluster k = 0 is 0.9207, the probability of belonging to cluster k = 1 is 0.0793, and the probability of belonging to cluster k = 2 is 0.0000. Therefore, the first item belongs to cluster k = 0.

GMM clustering isn’t used very often. Implementing GMM is much, much more difficult than implementing the basic k-means clustering algorithm so the cost of using GMM rather than k-means usually outweighs the benefits. But GMM is a fascinating technique, and one that every data scientist should be aware of.

*Three visually interesting (to me anyway) images from an Internet search for “mixed model art”.*

I was reminded of the topic when I saw a news article titled, “Michelle Obama Discusses White Flight” and noticed there were *thousands* of reader comments. I decided to do an informal manual scan of the comments to see if there were any patterns related to the comments that received over 1,000 up-votes.

*Machine learning is very good at sentiment analysis problems. Here’s a demo I created using the PyTorch neural library to analyze the sentiment of movies reviews.*

To set context, the former First Lady was at a conference, promoting a book she wrote. Some of her statements included the following.

“When we moved in, white families moved out … I want to remind white folks that y’all were running from us.”

“We’re no different than the immigrant families that are moving in.”

“I don’t know what’s going on, I can’t explain what’s happening in your head.”

“If you’re biased you’re broken and you can’t fix that brokeness.”

Here are some of the most agreed upon comments from readers.

“It’s not just ‘white flight’, sweetheart — you and your family left Chicago too.”

“A few years back I went to Detroit for an auto show. It was like entering a war zone. I don’t want this kind of thing coming to my neighborhood.”

“You say , ‘We’re no different’. There are a few differences Michelle. Other groups work hard, stay in school, respect the law, and don’t breed fatherlesss welfare children.”

“It’s not fear … it’s statistics. 51% of violent crime from 13% of the population.”

“Glad you’re gone. It’s so nice now to have a first lady in the White House with charm and grace instead of a media whore.”

I couldn’t determine why this particular news story generated so many comments. And there didn’t seem to be any pattern to the high-like comments. At first I thought it might have something to do with the fact that this was a combination of politics and race but I found many other stories that had the same general topic and those stories didn’t receive much notice.

The only thing that stands out is that the former First Lady’s statement completely grouped people together by race and made no individual distinctions. For example, “white folks” includes all white people, “We’re no different” includes all Black people, and so on. In other words her statements imply that all white people treat all Black people uniformly, and all Black people act uniformly. Perhaps readers don’t like statements that assume all people in a group think and act the same — essentially a definition of bias.

One thing I learned from statistics is that people are individuals and shouldn’t be grouped by skin color or age or spiritual belief or shoe size or anything else.

Another hypothesis is that readers don’t like being lectured by a person who achieved prominence by doing nothing (such as being a wife of a prominent person) rather than achieving prominence by doing something meaningful.

I’d say most of the readers’ comments were very harsh — even after likely filtering by the source agency. I don’t fully understand why people get so agitated over inconsequential statements by irrelevant people.

Machine learning has been very successful in many areas related to natural language processing, but I think scenarios like this have more to do with cognition, an area where ML still has a long way to go.

*Former First Lady Michelle Obama generates strong negative Internet news reader comments. Current First Lady Melania Trump tends to generate neutral or slightly positive comments. Machine learning and AI has not progressed to the point of being able to predict or explain such things. *

Suppose you have some data that represents people’s age, income, debt, and a class (0, 1, 2) that indicates how likely they are to buy something from your company. Let’s say you have 100 such data points. You want to use age, income, debt to predict the class. First you set a value for k. Suppose you set k = 4. Next you specify a source item to predict, say (0.39, 0.534, 0.152). Then you compute the mathematical distance from the source item to each of your 100 data points, and find the k = 4 nearest such points, and order by closest distance. Suppose the four closest points are:

idx age income debt distance class ======================================== [26] 0.39 0.539 0.151 0.0051 1 [13] 0.40 0.531 0.157 0.0116 0 [99] 0.38 0.561 0.149 0.0289 2 [57] 0.36 0.540 0.149 0.0307 1

With this information, using k = 4, it’s clear that the predicted class should be c = 1 because two out of the four closest points (a majority rule voting scheme), and the closet point, have class 1. There are several other voting schemes you can use.

In short, the main idea behind k-NN is to find the closest neighbors to the item you want to classify then see what those closest neighbors are like.

Somewhat surprisingly, implementing k-NN classification using a programming language like C# or Python, is much trickier than I expected. There wasn’t one major hurdle; there were lots of minorly tricky details.

*A demo of k-NN classification using the C# language.*

The image above shows an example of k-NN classification using raw C# (without some code library). It uses a voting technique called inverse weights.

I’m going to deliver a one-day, hands-on workshop at the upcoming Microsoft Azure + AI Conference. The event runs Nov. 17-22, 2019 in Las Vegas. See https://www.azureaiconf.com. My all-day workshop is titled “Practical Machine Learning Using C#” and is on Friday, Nov. 22, 2019. One of the six techniques the workshop will cover is k-NN classification. If you attend the conference, be sure to track me down and maybe we can be nearest neighbors at a bar after the workshop.

*Three more or less random images from an Internet image search for paintings of nearest neighbors at a bar.*