## Naive Bayes Classification Example Using the scikit Library

Naive Bayes classification is a classical machine learning technique. It is best used when the predictor variables are all non-numeric. Naive Bayes works for both binary classification and multi-class classification. And naive Bayes works well when you don’t have very much training data.

I coded up a demo of naive Bayes using the scikit-learn library. The ideas are best explained by an example.

Consider the following data:

```actuary   green   korea   F
barista   green   italy   M
dentist   hazel   japan   M
dentist   green   italy   F
chemist   hazel   japan   M
actuary   green   japan   F
actuary   hazel   japan   M
chemist   green   italy   F
chemist   green   italy   F
dentist   green   japan   F
barista   hazel   japan   M
dentist   green   japan   F
dentist   green   japan   F
chemist   green   italy   F
dentist   green   japan   M
dentist   hazel   japan   M
chemist   green   korea   F
barista   green   japan   F
actuary   hazel   italy   F
actuary   green   italy   M
```

The columns are job type, eye color, country, and sex. The overall goal is to predict sex from job, eye, and country. Suppose you want to predict the sex of a person who is (dentist, hazel, italy). If you look just at the dentists in the job column, 3 of the 7 dentists are male, and 4 of the 7 are female. So you’d (weakly) guess female. If you look just at the hazel values in eye color column, 5 of 6 people are male and just 1 of 6 are female. So you’d strongly guess male. If you look just at the italy values in the country column, 2 of 7 people are male and 5 of 7 are female. So you’d guess female.

The naive Bayes algorithm combines these frequencies to produce probabilities of male and female. The technique is called naive because it doesn’t take interactions between columns into account.

I coded up this problem using the scikit-learn library. The first (and most time-consuming step) is to convert the categorical values to integers: actuary = 0, barista = 1, chemist = 2, dentist = 3; green = 0, hazel = 1; italy = 0, japan = 1, korea = 2; female = 0, male = 1. In general it’s a good idea to encode using alphabetical order because scikit has an OrdinalEncoder class that encodes that way.

The encoded data looks like:

```# job_eye_country_sex.txt
# actuary=0, barista=1, chemist=2, dentist=3
# green=0, hazel=1
# italy=0, japan=1, korea=2
# female=0, male=1
#
0   0   2   0
1   0   0   1
3   1   1   1
3   0   0   0
2   1   1   1
0   0   1   0
0   1   1   1
2   0   0   0
2   0   0   0
3   0   1   0
1   1   1   1
3   0   1   0
3   0   1   0
2   0   0   0
3   0   1   1
3   1   1   1
2   0   2   0
1   0   1   0
0   1   0   0
0   0   0   1
```

I saved the encoded data as job_eye_country_sex.txt to be used as training data. I didn’t use any test data as I would in a non-demo scenario.

After reading the data into memory, the key statements to create a naive Bayes classifier are:

```from sklearn.naive_bayes import CategoricalNB

print("Creating naive Bayes classifier ")
model = CategoricalNB(alpha=1)
model.fit(X, y)
print("Done ")
```

In my demo, the predictions for (dentist, hazel, italy) are [0.33, 0.67] and so the predicted sex is male because the value at  (0.67) is larger than the value at .

Instead of preprocessing the raw data by converting strings to integers, it is possible to programmatically encode raw string data:

```from sklearn.preprocessing import OrdinalEncoder
print("\nReading raw data using genfromtxt() ")
train_file = ".\\Data\\job_eye_country_sex_raw.txt"
XY = np.genfromtxt(train_file, usecols=range(0,4),
delimiter="\t", dtype=str)

print("\nEncoding the data: ")
enc = OrdinalEncoder(dtype=np.int64)
enc.fit(XY)  # scan data
print("\nCategories: ")
print(enc.categories_)
XY = enc.transform(XY)  # encode data
X = XY[:,0:3]
y = XY[:,3]
# now good to go
```

Naive Bayes is best used for categorical data. If a column is numeric the values can be bucketed into encoded integers. There is a GaussianNB set of functions that can handle numeric data, but that algorithm makes several assumptions such as the data in each column is Gaussian (Normal / bell-shaped). I don’t recommend using naive Bayes directly on numeric data. The 1990s had some very strange but wonderful animated cartoon shows on TV. Here are three of my favorites. All three featured a naive character.

Left: “Aaahh!!! Real Monsters” (1994) features three young but nice monsters. Oblina (like a weird candy cane), Krumm (a hulking but naive monster who holds his eyes in his hands), and Ickis (sort of a demonic rabbit),

Center: “Rocko’s Modern Life” (1993) features the surreal life of an Australian wallaby named Rocko and his friends including a naive steer named Heffer Wolfe, and a neurotic turtle named Filburt.

Right: “CatDog” (1998) features the life of conjoined brothers of different species. The cat half is cynical; the dog half is naive.

Demo code:

```# naive_bayes.py

# Anaconda3-2020.02  Python 3.7.6
# scikit 0.22.1
# Windows 10/11

import numpy as np
from sklearn.naive_bayes import CategoricalNB

# ---------------------------------------------------------

def main():
# 0. prepare
print("\nBegin scikit naive Bayes demo ")
print("Predict sex (F = 0, M = 1) from job, eye, country ")
np.random.seed(1)

# actuary   green   korea   F
# barista   green   italy   M
# dentist   hazel   japan   M
# . . .
# actuary = 0, barista = 1, chemist = 2, dentist = 3
# green = 0, hazel = 1
# italy = 0, japan = 1, korea = 2

train_file = ".\\Data\\job_eye_country_sex.txt"
# print(y.shape)
# y = y.flatten()
# y = y.reshape(-1)
# y = y.squeeze()
print("Done ")

print("\nDiscretized features: ")
print(X)

print("\nActual classes: ")
print(y)

# 2. create and train model
print("\nCreating naive Bayes classifier ")
model = CategoricalNB(alpha=1)
model.fit(X, y)
print("Done ")
pred_classes = model.predict(X)

# 3. evaluate model
print("\nPredicted classes: ")
print(pred_classes)
acc_train = model.score(X, y)
print("\nAccuracy on train data = %0.4f " % acc_train)

# 3b. confusion matrix
from sklearn.metrics import confusion_matrix
y_predicteds = model.predict(X)
cm = confusion_matrix(y, y_predicteds)  # actual, pred
print("\nConfusion matrix raw: ")
print(cm)

# 4. use model
# dentist, hazel, italy = [3,1,0]
print("\nPredicting class for dentist, hazel, italy ")
probs = model.predict_proba([[3,1,0]])
print("\nPrediction probs: ")
print(probs)

predicted = model.predict([[3,1,0]])
print("\nPredicted class: ")
print(predicted)

# 5. TODO: save model using pickle

print("\nEnd demo ")

if __name__ == "__main__":
main()
```

## Understanding the PyTorch Linear Layer Default Weight and Bias Initialization

When a PyTorch neural network is instantiated, it’s common practice to use implicit weight and bias initialization. In the case of a Linear layer, the PyTorch documentation is not clear, and the source code is surprisingly complicated. I spent several hours experimenting with Linear initialization and after a lot of work I was able to implement a demo program where I used explicit weight and bias initialization code to get identical values as those produced by the default implicit mechanism. For Linear layers, PyTorch uses what is called the Kaiming (aka He) algorithm.

Note: Kaiming/He initialization is closely related to, but different from, Xavier/Glorot initialization.

The demo network is 3-4-2 (3 input, 4 hidden, 2 output):

```import math
import torch as T
device = T.device('cpu')

class Net(T.nn.Module):
def __init__(self, init_type):
super(Net, self).__init__()
self.hid1 = T.nn.Linear(3, 4)  # 3-4-2
self.oupt = T.nn.Linear(4, 2)

if init_type == 'default':
T.manual_seed(1)
# now calls kaiming_uniform_()
elif init_type == 'explicit':
T.manual_seed(1)
T.nn.init.kaiming_uniform_(self.hid1.weight,
a=math.sqrt(5.0))
bound = 1 / math.sqrt(3.0)  # fan-in = 3
self.hid1.bias.uniform_(-bound, bound)

T.nn.init.kaiming_uniform_(self.oupt.weight,
a=math.sqrt(5.0))
bound = 1 / math.sqrt(4.0)  # fan-in = 4
self.oupt.bias.uniform_(-bound, bound)
. . .
```

So creating a neural network using the default implied initialization is:

```net1 = Net('default').to(device)
```

And using explicit initialization is:

```net2 = Net('explicit').to(device)
```

In either case, the initial values of the weights and biases are the same.

The code is short but very tricky. A big stumbling block was positioning the calls to the T.manual_seed(1) statements. Because the torch random number generator gets called sequentially several times, the only way to demonstrate identical results is to reset the seed.

Another pitfall was the mysterious math.sqrt(5) value for the equally mysterious “a” parameter. And yet another pitfall was the tensor.uniform_() function which is used for the biases.

Note: This blog post is essentially a follow-up to an earlier blog post where I made a careless mistake: https://jamesmccaffrey.wordpress.com/2022/01/06/pytorch-explicit-vs-implicit-weight-and-bias-initialization/.

I like PyTorch a lot, but the weight and bias initialization code and documentation is a weak point. By the way, Kaiming initialization was devised specifically for very deep convolutional networks. I’m not sure why the designers of PyTorch decided to use Kaiming as the default for Linear layers. In my earlier blog post, my mistake was a simple misspelling. Misspellings can happen in sports too. Left: From a college football game at University of Southern California. The USC slogan is “Fight On!” (I did my grad work at USC). Right: From an NFL professional football game for the New York Jets. I’m not kidding.

Demo code:

```# explore_nn_init.py
# weight and bias init investigation
# PyTorch 1.12.1+cpu Anaconda3-2020.02  Python 3.7.6
# Windows 10/11

# -----------------------------------------------------------
# 1. possible to get same results as init.kaiming_uniform_ ?
# -----------------------------------------------------------

import math
import torch as T
device = T.device('cpu')  # apply to Tensor or Module

# -----------------------------------------------------------

class Net(T.nn.Module):
def __init__(self, init_type):
super(Net, self).__init__()
self.hid1 = T.nn.Linear(3, 4)  # 3-4-2
self.oupt = T.nn.Linear(4, 2)

if init_type == 'default':
T.manual_seed(1)
# now calls kaiming_uniform_()
elif init_type == 'explicit':
T.manual_seed(1)
T.nn.init.kaiming_uniform_(self.hid1.weight,
a=math.sqrt(5.0))
bound = 1 / math.sqrt(3.0)  # fan-in = 3
self.hid1.bias.uniform_(-bound, bound)

T.nn.init.kaiming_uniform_(self.oupt.weight,
a=math.sqrt(5.0))
bound = 1 / math.sqrt(4.0)  # fan-in = 4
self.oupt.bias.uniform_(-bound, bound)

def forward(self, x):
z = T.tanh(self.hid1(x))
z = self.oupt(z)  # no activation: CrossEntropyLoss()
return z

# -----------------------------------------------------------

def main():
print("\nBegin Linear layer init demo ")
T.manual_seed(1)

print("\n==================== ")
print("\nCreating a 3-4-2 network default init ")
net1 = Net('default').to(device)

print("\nhid1 wts and biases: ")
print(net1.hid1.weight.data)
print(net1.hid1.bias.data)

print("\noupt wts and biases: ")
print(net1.oupt.weight.data)
print(net1.oupt.bias.data)

print("\n==================== ")
print("\nCreating a 3-4-2 network explicit init ")
net2 = Net('explicit').to(device)

print("\nhid1 wts and biases: ")
print(net2.hid1.weight.data)
print(net2.hid1.bias.data)

print("\noupt wts and biases: ")
print(net2.oupt.weight.data)
print(net2.oupt.bias.data)

print("\n==================== ")
print("\nEnd initialization demo ")

if __name__ == "__main__":
main()
```
Posted in PyTorch | 1 Comment

## A Simplified Version of the scikit Library make_circles() Function

I was looking at spectral clustering with the scikit-learn library. Standard k-means clustering doesn’t work well for data that has weird geometry. A standard example is data that when graphed looks like two concentric circles. Spectral clustering connects data into a virtual graph which allows it to deal with weird geometry. I wanted to run some experiments with spectral clustering and so I needed some data to work with. The scikit library has a make_circles() function that is intended for clustering experiments. I don’t like dependencies, so I figured I’d go to the scikit source code and fetch the code for the make_circles() function. How complicated could it be?

Agh! I quickly discovered that the scikit make_circles() function was very complicated. It had dependencies which had dependencies which had yet more dependencies. There were at least 400 lines of code.

I was engineering-annoyed and decided to pull just the essential code and implement a simplified version. All I needed was approximately 15 lines of code:

```def my_make_circles(n_samples=100, factor=0.8,
noise=None, seed=1):

rnd = np.random.RandomState(seed)
n_samples_out = n_samples // 2
n_samples_in = n_samples - n_samples_out

lin_out = np.linspace(0, 2 * np.pi, n_samples_out,
endpoint=False)
lin_in = np.linspace(0, 2 * np.pi, n_samples_in,
endpoint=False)
outer_circ_x = np.cos(lin_out)
outer_circ_y = np.sin(lin_out)
inner_circ_x = np.cos(lin_in) * factor
inner_circ_y = np.sin(lin_in) * factor

X = np.vstack(
[np.append(outer_circ_x, inner_circ_x),
np.append(outer_circ_y, inner_circ_y)]).T
y = np.hstack(
[np.zeros(n_samples_out, dtype=np.int64),
np.ones(n_samples_in, dtype=np.int64)])

if noise is not None:
X += rnd.normal(loc=0.0, scale=noise, size=X.shape)

return X, y
```

I wrote a demo. The key calling statement is:

```data, labels = my_make_circles(n_samples=20,
factor=0.5, noise=0.08, seed=0)
```

The return is a Tuple with a “data” and a “labels”. There are 20 rows. The “data” item has an x coordinate and a y coordinate. The first 10 rows are the outer circle. The second 10 rows are the inner circle. The return “labels” has 10 zeros followed by 10 ones.

The factor parameter should be between 0.0 and 1.0 and controls how small the inner circle is. Smaller values make a smaller inner circle. The noise parameter is the standard deviation of the Standard Normal distribution that adds randomness. A value of 0.0 gives perfect circles, larger values give a more random circular shape.

This example points out that library functions are always larger than implementing from scratch. Because library functions can be used in so many ways, they must have a lot of parameters to deal with many different ways the function might be used, and a lot of error-checking code, and a lot of extra code to make all library functions work with each other.

Good fun. Three interesting portraits with a circular theme. Left: By famous art nouveau illustrator Alphonse Mucha (1860-1939). Center: By contemporary artist Manuel Nunez. Right: By contemporary artist Karol Bak.

Demo code:

```# spectral_cluster_scikit.py

import numpy as np
import matplotlib.pyplot as plt

# ---------------------------------------------------------

def my_make_circles(n_samples=100, factor=0.8,
noise=None, seed=1):

rnd = np.random.RandomState(seed)
n_samples_out = n_samples // 2
n_samples_in = n_samples - n_samples_out

lin_out = np.linspace(0, 2 * np.pi, n_samples_out,
endpoint=False)
lin_in = np.linspace(0, 2 * np.pi, n_samples_in,
endpoint=False)
outer_circ_x = np.cos(lin_out)
outer_circ_y = np.sin(lin_out)
inner_circ_x = np.cos(lin_in) * factor
inner_circ_y = np.sin(lin_in) * factor

X = np.vstack(
[np.append(outer_circ_x, inner_circ_x),
np.append(outer_circ_y, inner_circ_y)]).T
y = np.hstack(
[np.zeros(n_samples_out, dtype=np.int64),
np.ones(n_samples_in, dtype=np.int64)])

if noise is not None:
X += rnd.normal(loc=0.0, scale=noise, size=X.shape)

return X, y

# ---------------------------------------------------------

def main():
print("\nBegin simplified make_cicrles() demo ")

data, labels = my_make_circles(n_samples=20,
factor=0.5, noise=0.08, seed=0)

print("\ndata = ")
print(data)
print("\nlabels = ")
print(labels)

outer = data[0:10,:]
inner = data[10:20,:]

plt.scatter(outer[:,0], outer[:,1])
plt.scatter(inner[:,0], inner[:,1])
plt.show()

print("\nEnd ")

if __name__ == "__main__":
main()
```

## “Researchers Evaluate the Top Four AI Stories of 2022” on the Pure AI Web Site

I contributed to an article titled “Researchers Evaluate the Top Four AI Stories of 2022” in the January 2023 edition of the Pure AI web site. See https://pureai.com/articles/2023/01/05/top-ai-stories-of-2022.aspx. I am a regular contributing editor for the Pure AI site. For this article, I collaborated with two other AI experts and we reviewed AI/ML news stories from 2022. We ultimately agreed that four significant stories were:

The release of the DALL-E 2 system to generate artificial images.
The development of the AlphaFold system to predict protein structure.
The development of the Cicero system to play master-level Diplomacy.
The release of ChatGPT to answer arbitrary text-based questions. Images generated by DALL-E by a prompt of, “A painting of a fox sitting in a field at sunrise in the style of Claude Monet”. OK, but how can this generate significant revenue?

Of these four, the two that I had the strongest opinions on were AlphaFold and ChatGPT. I was impressed with AlphaFold and said that it’s possible the system could lead to huge advances in biology. I was not impressed with ChatGPT, feeling it’s over-hyped. The main problem with ChatGPT is that there’s little or no control over the source of responses to a question like, “Why did Russia invade Ukraine?” Research shows that women chat much more than men in social scenarios. There are surprisingly few chatty women robots in science fiction movies. Left: Ava is an advanced experiment in “Ex Machina” (2014). Center: Rachel is an administrative assistant in “Blade Runner” (1982). Right: Arlette is a harlot in “Westworld” (1973).

## What Are Correct Values for Precision and Recall When the Denominators Are Zero?

I did an Internet search for “What are correct values for precision and recall when the denominators equal 0?” and was pointed to a StackExchange page which had been up for over 11 years — and which was somewhat ambiguous. See https://stats.stackexchange.com/questions/8025/what-are-correct-values-for-precision-and-recall-when-the-denominators-equal-0.

The source of the issue is the definitions of FP (false positives) and FN (false negatives).

Years ago, I was taught that TP (true positives) are actual class positive (1) that are predicted correctly. FP (false positives) are actual class positive (1) that are predicted incorrectly. TN (true negatives) are actual class negative (0) that are predicted correctly. FN (false negatives) are actual class negative (0) that are predicted incorrectly. These definitions make sense from an English point of view.

However, a more common definition nowadays is that FP is an actual negative that’s predicted incorrectly, and FN is an actual positive that’s predicted incorrectly.

Precision and recall are defined as:

```p (precision) = TP / (TP + FP)
r (recall)    = TP / (TP + FN)
```

Here’s an example using the modern, more common definitions:

```actual  predicted
0       0        TN
0       0        TN
0       1        FP
0       1        FP
1       0        FN
1       1        TP
1       1        TP
1       1        TP

p = TP / (TP + FP) = 3 / (3+2) = 3/5

r = TP / (TP + FN) = 3 / (3+1) = 3/4
```

OK. But what if either denominator is 0. For precision if TP + FP = 0, then TP = 0 and also FP = 0. The only way this could happen is if all predictions are 0, for example:

```actual  predicted
0       0        TN
0       0        TN
0       0        TN
0       0        TN
1       0        FN
1       0        FN
1       0        FN
1       0        FN
```

In other words, all predictions are negative. This should raise a warning that the prediction system could be broken. The StackExchange page states, “If (true positives + false positives) = 0 then all cases have been predicted to be negative.” This is true if the more common definitions of FP and FN are used.

Now for recall, if TP + FN = 0, then TP = 0 and also FN = 0. This could happen like this:

```actual  predicted
0       0        TN
0       0        TN
0       1        FP
0       1        FP
0       0        TN
0       0        TN
0       0        TN
0       0        TN
```

All actual labels are negative. If any actual data is positive, than there’d be at least one TP (if predicted correctly) or one FN (if predicted incorrectly). This scenario (all positive data) should never happen. The StackExchange page states, “If (true positives + false negatives) = 0 then no positive cases in the input data.”

Bottom line:

Using the common definitions of FP and FN:

1.) If TP + FP = 0, a warning should be printed that the prediction system is likely flawed because it always predicts class negative.

2.) If TP + FN = 0, a warning should be printed that the data is flawed because there are no actual positives.

Do not believe everything you read on the Internet without questioning it. I’m not sure why, but I always associate the word “precision” with high quality wrist watches. Here are three watches that don’t immediately bring the idea of precision to mind. Left: This one is a-mazing. Center: I don’t know why all wrist watches don’t have built-in lighters. What could go wrong? Right: Very stylish, but the vacuum tubes might not be necessary.

Posted in Machine Learning | 1 Comment

## NFL 2022 Season Super Bowl LVII Prediction – Zoltar Predicts the Chiefs Will Beat the Eagles

Zoltar is my NFL football prediction computer program. It uses reinforcement learning and a neural network. Here are Zoltar’s predictions for week #22 (Super Bowl LVII) of the 2022 season.

```Zoltar:      chiefs  by    3  dog =     eagles    Vegas:      eagles  by  2.0
```

Zoltar theoretically suggests betting when the Vegas line is “significantly” different from Zoltar’s prediction. For this season I’ve been using a threshold of 4 points difference but in some previous seasons I used 3 points. For Super Bowl LVII (week #22) Zoltar thinks the Kansas City Chiefs are 3 points better than the Philadelphia Eagles. Las Vegas thinks the Eagles are 2.0 points better than the Chiefs. Therefore, the difference of opinion is 5 points and so Zoltar suggests a wager on the Chiefs.

A bet on the Chiefs will pay off if the Chiefs win by any score, or if the Eagles win by less than 2.0 points (i.e., 1 point). If the Eagles win by exactly 2 points, the bet is a push.

Theoretically, if you must bet \$110 to win \$100 (typical in Vegas) then you’ll make money if you predict at 53% accuracy or better. But realistically, you need to predict at 60% accuracy or better.

In week #21, against the Vegas point spread, Zoltar went 0-0 using 4.0 points as the advice threshold because Zoltar’s two predictions (49ers at Eagles, and Bengals at Chiefs) both agreed closely with Vegas predictions.

For the season, against the spread, Zoltar is 57-31 (~64% accuracy).

Just for fun, I track how well Zoltar does when just trying to predict just which team will win a game. This isn’t useful except for parlay betting. In week #21, just predicting the winning team, Zoltar went 2-0. Vegas also went 2-0 at just predicting the winning team. My prediction system is named after the Zoltar fortune teller machine you can find in arcades. Arcade Zoltar is named after the Zoltar machine that was featured in the 1988 movie “Big” where the machine was magical and granted a boy’s wish to become an adult. The movie Zoltar was named after a 1960s era fortune teller machine named Zoltan.

## “Logistic Regression from Scratch Using Raw Python” in Visual Studio Magazine

I wrote an article titled “Logistic Regression from Scratch Using Raw Python” in the January 2023 edition of Microsoft Visual Studio Magazine. See https://visualstudiomagazine.com/articles/2023/01/18/logistic-regression.aspx. Logistic regression is a machine learning technique for binary classification. For example, you might want to predict the sex of a person (male or female) based on their age, state where they live, income and political leaning. There are many other techniques for binary classification, but logistic regression was one of the earliest developed and the technique is considered a fundamental machine learning skill for data scientists. The article presents a complete end-to-end demo. The goal is to predict a person’s sex based on age, State, income, and political leaning. The data looks like:

```1   0.24   1   0   0   0.2950   0   0   1
0   0.39   0   0   1   0.5120   0   1   0
1   0.63   0   1   0   0.7580   1   0   0
0   0.36   1   0   0   0.4450   0   1   0
1   0.27   0   1   0   0.2860   0   0   1
. . .
```

The tab-delimited fields are sex (0 = male, 1 = female), age (divided by 100), state (Michigan = 100, Nebraska = 010, Oklahoma = 001), income (divided by \$100,000) and political leaning (conservative = 100, moderate = 010, liberal = 001).

The structure of the demo program is:

```# people_gender_log_reg.py
# Anaconda3-2020.02  Python 3.7.6
# Windows 10/11

import numpy as np

# -----------------------------------------------------------

def compute_output(w, b, x): . . .
def accuracy(w, b, data_x, data_y): . . .
def mse_loss(w, b, data_x, data_y): . . .

# -----------------------------------------------------------

def main():
print("Begin logistic regression with raw Python demo ")
np.random.seed(1)

# 1. load data into memory
# 2. create model
# 3. train model
# 4. evaluate model
# 5. use model

print("End People logistic regression demo ")

if __name__ == "__main__":
main()
```

Logistic regression is relatively easy to implement — so much so that a from-scratch version is easier for me than using a logistic regression module from a library like ML.NET or scikit-learn. School employees create signage from scratch, but sometimes they’d be better off getting someone else to do the message spelling.

## Solving the Traveling Salesman Problem (TSP) Using an Epsilon-Greedy Algorithm

An epsilon-greedy algorithm is a general approach that can be used for many different problems. I recently devised a nice evolutionary algorithm for the Traveling Salesman Problem (TSP) that seems to work very well. Just for fun, I spent one weekend morning implementing an epsilon-greedy algorithm for TSP. Briefly: the epsilon-greedy approach did not work as well as my evolutionary algorithm.

In high-level pseudo-code, an epsilon-greedy algorithm for TSP looks like:

```create a random guess solution
loop many times
use curr guess to generate a candidate guess
if candidate is best seen so far, save it
generate an epsilon in [0, 1]
if epsilon less-than small value:
update curr guess (accept epsilon worse solution)
else if candidate is better than curr:
update curr guess (greedy part)
else:
no to curr guess
end-loop
return best guess found
```

The value of epsilon must be determined by trial and error. Typical values are things like 0.01 and 0.05.

A non-technical issue here is that like most of the machine learning engineers and scientists I know, once I latch onto a problem like TSP, I find it difficult to avoid trying multiple algorithms, even in cases where I’m quite sure some algorithms don’t work well. In computer science, “epsilon” usually refers to a very small value. Here is an example of a small processor from IBM hardware research. The processor is about the size of a grain of salt, and has roughly the power of a 1990s CPU.

Demo code:

```# tsp_epsilon_greedy.py

import numpy as np

# -----------------------------------------------------------

class Solver():
def __init__(self, num_cities, distances, seed):
self.num_cities = num_cities
self.rnd = np.random.RandomState(seed)
self.distances = distances  # by ref

self.curr_soln = np.arange(num_cities, dtype=np.int64)
self.rnd.shuffle(self.curr_soln)
self.curr_err = self.compute_error(self.curr_soln)

self.best_soln = np.copy(self.curr_soln)  # soln
self.best_err = self.curr_err            # err

def show(self):
# print("Best solution: ")
print(self.best_soln, end ="")
print(" | %0.4f " % self.best_err)

def compute_error(self, soln):
n = self.num_cities
d = 0.0
for i in range(n-1):  # note
from_idx = soln[i]
to_idx = soln[i+1]
d += self.distances[from_idx][to_idx]
return d

def solve(self, max_iter):
for iter in range(max_iter):
# 1. pick two indices
idx1 = self.rnd.randint(0,self.num_cities)
idx2 = self.rnd.randint(0,self.num_cities)
# print(idx1); print(idx2); input()

# 2. create an offspring/candidate
candidate = np.copy(self.curr_soln)
tmp = candidate[idx1]
candidate[idx1] = candidate[idx2]
candidate[idx2] = tmp
cand_err = self.compute_error(candidate)
# print(candidate); print(cand_err); input()

# 3. is candidate new best?
if cand_err "lt" self.best_err:
self.best_soln = np.copy(candidate)
self.best_err = cand_err

# 4. if candidate better then src, replace
epsilon = self.rnd.random()
# print(epsilon)
if epsilon "lt" 0.01:
self.curr_soln = np.copy(candidate)
self.curr_err = cand_err
elif cand_err "lt" self.best_err:
# print("new curr soln ")
self.curr_soln = np.copy(candidate)
self.curr_err = cand_err
else:
pass
# no change to curr soln
# print("no change to curr soln ")

# -----------------------------------------------------------

def get_distances(n_cities):
# from text file in non-demo scenario
result = np.zeros((n_cities, n_cities), dtype=np.float32)
for i in range(n_cities):
for j in range(n_cities):
if i "lt" j: result[i][j] = (j - i) * 1.0
elif j "lt" i: result[i][j] = (i - j) * 1.5
return result

# -----------------------------------------------------------

def main():
print("\nBegin TSP using epsilon-greedy ")
print("Setting TSP n = 20 cities ")
print("Note: 20! = 2,432,902,008,176,640,000 ")
print("Optimal soln is 0 1 2 . . 19 with dist = 19.0 ")

num_cities = 20
distances = get_distances(num_cities)

solver = Solver(num_cities, distances, seed=1)
print("\nInitial guess: ")
solver.show()

print("\nBegin search ")
solver.solve(10000)
print("Done ")

print("\nFinal best estimate: ")
solver.show()

print("\nEnd demo ")

# -----------------------------------------------------------

if __name__ == "__main__":
main()
```

## Another Look at GPT-3 / Codex / GitHub Copilot – I Have Mixed Opinions

GPT-3 (“Generative Pre-trained Transformer”) is a large language model that can generate text, such as a response to, “Write two paragraphs about the history of computer programming.” GPT3 was trained on an enormous corpus of text — Wikipedia, books, blogs, and so on. GPT3 is owned by OpenAI with major funding by Microsoft. ChatGPT is a chatbot based on GPT-3.

Codex is an extension of GPT-3 and ChatGPT where additional training on billions of lines of computer code was applied — so in addition to natural language, Codex understands natural language and computer code in several languages such as Python, JavaScript, C#, C++, HTML, CSS, SQL, and others.

You use GPT-3 and Codex by typing queries into a text box. A Codex query might be, “Show me a C# function for binary search.” The result would be code that you can copy-paste.

GitHub Copilot is essentially a wrapper over Codex that’s integrated into a development environment such as Microsoft Visual Studio. This is more efficient because you get the Codex results directly into a program instead of having to copy-paste. A downside is that, after a free trial period of 60 days, you must pay \$10 per month for Copilot (currently — by the time you read this he cost might be different). A long time ago, I experimented with a pre-release version of Copilot. It had quite a few bugs and glitches. I figured I’d investigate the released version.

Because I use Visual Studio for most of my C# development, I decided to install Copilot on VS. Installation directions for Copilot at https://docs.github.com/en/copilot/quickstart were very good and I had no major trouble getting Copilot installed. Briefly, I logged onto my GitHub account, went to Settings, and requested access to Copilot. After giving my credit card info, GitHub asked for an authorization code. I opened up VS and created a dummy project and then added the GitHub Copilot Extension from the Microsoft Store (rather than from GitHub as I had first expected, because there is billing involved). VS then gave me an 8-character authorization code which I placed into GitHub. I restarted VS and viola, Copilot was installed in VS.

Using Copilot in VS is quite a remarkable experience. Briefly, you enter a comment and then hit enter. I used a dummy Console Application (shell) C# project. I typed the comment:

```// show code for a list of color items
```

and got:

```public class ColorItem
{
public string Color { get; set; }
public string Name { get; set; }
}
```

Cool. Then I typed:

```// create a list of color items
```

I got the following. Notice Copilot knew the current conversation context.

```List colorItems = new List
{
new ColorItem { Color = "Red", Name = "Red" },
new ColorItem { Color = "Green", Name = "Green" },
new ColorItem { Color = "Blue", Name = "Blue" }
};
```

Next I typed:

```// now function to print the list
```

and got:

```public void PrintColorItems()
{
foreach (var item in colorItems)
{
Console.WriteLine(item.Color + " " + item.Name);
}
}
```

I have strong mixed opinions about Codex / Copilot. Copilot is essentially an efficient alternative to issuing queries into Google and then reading through blog posts, Stack Overflow answers, and so on. But the inefficiency of the Google approach is useful because it simultaneously teaches me and gets lodged into my memory. Copilot just gives an answer with no active learning going on and so my skills don’t grow.

My conclusion is that Copilot (and Codex) is probably good for simple programming tasks such as generating CSS, SQL, or common C# code like an array binary search function. But Copilot is probably not good for advanced algorithms such as those I use in my machine learning work. Two parodies of computer programming book covers that are spot-on.

## Binary Classification Using a scikit Decision Tree

I hadn’t looked at using a decision tree from the scikit-learn (scikit for short) library for several months, so I figured to do an example. Before I go any further: I am not a big fan of decision trees and this example reinforces my opinion.

I used one of my standard datasets for binary classification. The data looks like:

```  1   0.24   1 0 0   0.2950   0 0 1
0   0.39   0 0 1   0.5120   0 1 0
1   0.63   0 1 0   0.7580   1 0 0
0   0.36   1 0 0   0.4450   0 1 0
. . .
```

Each line of data represents a person. The fields are sex (male = 0, female = 1), age (normalized by dividing by 100), state (Michigan = 100, Nebraska = 010, Oklahoma = 001), annual income (divided by 100,000), and politics type (conservative = 100, moderate = 010, liberal = 001). The goal is to predict the gender of a person from their age, state, income, and politics type. The data can be found at: https://jamesmccaffrey.wordpress.com/2022/09/23/binary-classification-using-pytorch-1-12-1-on-windows-10-11/. It isn’t necessary to normalize age and income. Converting categorical predictors like state and job-type is conceptually tricky, but the bottom line is that in most scenarios it’s best to one-hot encode rather than categorical encode. Ordinal data like low = 0, medium = 1, high = 2 can be ordinal-encoded.

The scikit documentation states that for binary classification, the variable to predict should be encoded as -1 and +1. However, one of the documentation examples uses 0 and 1 which is more consistent with other binary classification algorithms. In fact, it’s possible to use text labels, such as “M” and “F”, for the target class too. The scikit documentation has quite a few inconsistencies like this.

The key lines of code are:

```md = 4  # depth
print("Creating decision tree with max_depth=" + str(md))
model = tree.DecisionTreeClassifier(max_depth=md,
random_state=1)
model.fit(train_x, train_y)
print("Done ")
```

Decision trees are highly sensitive to overfitting. If you set a large max_depth you can get 100% classification on training data but the accuracy on test data and new previously unseen data will likely be very poor.

I wrote a custom tree_to_pseudo() function that displays a tree in text format. Note: I kludged the function together from examples I found on the Internet — the code is very, very tricky. The first part of the output of tree_to_pseudo() looks like:

``` if ( income "lte" 0.3400 ) {
if ( pol2 "lte"  0.5 ) {
if ( age "lte"  0.235 ) {
if ( income "lte"  0.2815 ) {
return 1.0
} else {
return 0.0
}
} else {
return 1.0
}
} else {
return 1.0
}
} else {
. . .
```

There is also a built-in tree.export_text() function that gives similar results:

```|--- income "lte"  0.34
|   |--- pol2 "lte"  0.50
|   |   |--- age "lte"  0.23
|   |   |   |--- income "lte" = 0.28
|   |   |   |   |--- class: 1.0
. . .
```

Anyway, the demo was a good refresher for me. A few years ago some researchers in the UK did an experiment where they determined that people can identify the gender of a person solely by how they walk. Fashion models extend this idea to exaggerated walking styles. Left: The long stride. Center: The walk on a straight line. Right: The sway.

Demo code. Replace “lte” with Boolean less-than-or-equal operator. The data can be found at jamesmccaffrey.wordpress.com/2022/09/23/binary-classification-using-pytorch-1-12-1-on-windows-10-11/.

```# people_gender_tree.py

# predict gender (0 = male, 1 = female)
# from age, state, income, politics-type

# data:
#  0   0.39   0   0   1   0.5120   0   1   0
#  1   0.27   0   1   0   0.2860   1   0   0
# . . .

# Anaconda3-2020.02  Python 3.7.6
# Windows 10/11  scikit 0.22.1

import numpy as np
from sklearn import tree

# ---------------------------------------------------------

def tree_to_pseudo(tree, feature_names):
left = tree.tree_.children_left
right = tree.tree_.children_right
threshold = tree.tree_.threshold
features = [feature_names[i] for i in tree.tree_.feature]
value = tree.tree_.value

def recurse(left, right, threshold, features, node, depth=0):
indent = "  " * depth
if (threshold[node] != -2):
print(indent,"if ( " + features[node] + " <= " + \
str(threshold[node]) + " ) {")
if left[node] != -1:
recurse(left, right, threshold, features, \
left[node], depth+1)
print(indent,"} else {")
if right[node] != -1:
recurse(left, right, threshold, features, \
right[node], depth+1)
print(indent,"}")
else:
idx = np.argmax(value[node])
# print(indent,"return " + str(value[node]))
print(indent,"return " + str(tree.classes_[idx]))

recurse(left, right, threshold, features, 0)

# ---------------------------------------------------------

def main():
print("\nBegin scikit decision tree example ")
print("Predict sex from age, state, income, politics ")
np.random.seed(0)

train_file = ".\\Data\\people_train.txt"
train_x = train_xy[:,1:9]
train_y = train_xy[:,0]

test_file = ".\\Data\\people_test.txt"
test_x = test_xy[:,1:9]
test_y = test_xy[:,0]

np.set_printoptions(suppress=True)
print("\nTraining data:")
print(train_x[0:4])
print(". . . \n")
print(train_y[0:4])
print(". . . ")

# 2. create and train
md = 4
print("\nCreating decision tree max_depth=" + str(md))
model = tree.DecisionTreeClassifier(max_depth=md,
random_state=1)
model.fit(train_x, train_y)
print("Done ")

# 3. evaluate
acc_train = model.score(train_x, train_y)
print("\nAccuracy on train = %0.4f " % acc_train)
acc_test = model.score(test_x, test_y)
print("Accuracy on test = %0.4f " % acc_test)

# 3b. use model
# print("\nPredict age 36, Oklahoma, \$50K, moderate ")
# x = np.array([0.36, 0,0,1, 0.5000, 0,1,0],
#   dtype=np.float32)
# predicted = model.predict(x)
# print(predicted)

# 4. visualize
print("\nTree in pseudo-code: ")
tree_to_pseudo(model, ["age", "state0", "state1", "state2",
"income",  "pol0", "pol1", "pol2"])

# 4b. use built-in export_text()
# recall: from sklearn import tree
pseudo = tree.export_text(model, ["age", "state0", "state1",
"state2", "income",  "pol0", "pol1", "pol2"])
print(pseudo)

# 4c. use built-in plot_tree()
import matplotlib.pyplot as plt
tree.plot_tree(model, feature_names=["age", "state0",
"state1", "state2", "income",  "pol0", "pol1", "pol2"],
class_names=["male", "female"])
plt.show()

# 5. TODO: save model using pickle

if __name__ == "__main__":
main()
```