## Gaussian Process Classification on the Wheat Seeds Dataset Using the scikit Library

A classification problem is one where the goal is to predict a single categorical value. For example, you might want to predict sex of a person (0 = male, 1 = female) based on age, income, and so on (a binary classification problem). Or you might want to predict the species of a wheat seed (0 = Kama, 1 = Rosa, 2 = Canadian) based on predictors like seed length, width, and so on (a multi-class problem).

There are many completely different techniques for classification. Examples include logistic regression (binary classification), naive Bayes (different versions depending on the data types of the predictor variables), decision trees (many variations such as AdaBoost), and neural networks.

A relatively rare technique for classification (among my colleagues at least) is Gaussian process classification (GPC). When GPC works, it often works very well, but when GPC fails, it often fails spectacularly.

I put together a brief demo using the scikit GaussianProcessClassifier module. For my demo I used the Wheat Seeds Dataset. There are 210 source data items, each representing one of three species of wheat seeds. There are 7 predictor variables.

As is usually the case, preparing the data was time-consuming and took far longer than building the machine learning model. I used divide-by-k normalization on the 7 predictors. The divide constants I used are (25, 20, 1, 10, 10, 10, 10) so that all predictors are between 0.0 and 1.0. After normalizing the 210 data items, I used the first 60 of each class/species for a 180-item training dataset, and the last 10 of each class/species as a 30-item test dataset. The resulting data looks like:

```0.6104  0.7420  0.8710  0.5763  0.3312  0.2221  0.5220  0
0.5952  0.7285  0.8811  0.5554  0.3333  0.1018  0.4956  0
. . .
0.7052  0.7990  0.8673  0.6191  0.3561  0.4076  0.6060  1
0.6736  0.7835  0.8623  0.5998  0.3484  0.4675  0.5877  1
. . .
0.5048  0.6835  0.8481  0.5410  0.2911  0.3306  0.5231  2
0.5104  0.6690  0.8964  0.5073  0.3155  0.2828  0.4830  2
```

Using scikit GPR is simultaneously easy and difficult. The key code that creates my demo classification model is:

```  print("Creating GPC model with default RBF kernel ")
krnl = 1.0 * RBF(1.0)  # RBF parameter will be optimized
model = GaussianProcessClassifier(kernel=krnl, random_state=0)

print("Training model ")
model.fit(train_X, train_y)
print("Done ")
```

The code is deceptively simple because the main effort is spent on guessing which kernel functions to use and all their parameters. Surprisingly, the default RBF (radial basis function) kernel worked very well and gave 80.00% accuracy (24 out of 30 correct) on the test data. This is pretty much the same results I get when using a sophisticated PyTorch neural network classifier. In most GPC scenarios you spend a lot of time trying different kernel functions with different parameters.

Gaussian process classification is very complex mathematically. It is an extension of Gaussian process regression, where the output is a single numerice value, plus a logistic sigmoid link function to squash the output to a probability (for binary classification), plus a one-vs-rest scheme for multi-class classification.

There are dozens of blog posts and YouTube videos that explain GPR and GPC at many different levels. There’s not one best explanation of GPR — it all depends on what sort of background you have. Do you already understand kernels or not? Do you already understand multivariate Gaussian distributions or not? Do you already understand covariance matrices or not? Do you already understand the logistic sigmoid function or not? And so on.

Good fun though.

When Gaussian process regression or classification models fail, they often fail spectacularly. I grew up during the early days of the U.S. space program when there were spectacular successes and failures.

Left: The very first U.S. attempt to launch a satellite on a Vanguard rocket in 1957 was an embarrassing failure but the lessons learned eventually led to the first man on the moon just 12 years later.

Center: A Juno rocket failed a few seconds after launch in 1959 when its guidance system malfunctioned. Just a few months later, on May 5, 1961, Alan Shepard sat on top of a lengthened version of Juno called Redstone, and became the first American in space. Tremendous bravery!

Right: An Atlas rocket failed on the launch pad in September 1961. The design of the Atlas required internal pressure to retain structural integrity, and loss of pressure was disastrous. Just five months later, on February 20, 1962, John Glenn sat on top of an Atlas booster and became the first American in orbit. What incredible courage the early American astronauts and Soviet cosmonauts had! Sadly, NASA has lost its way and the upcoming mission to the moon is a politically correct embarrassment with two clearly unqualified astronauts, and Russia’s invasion of Ukraine is beyond despicable. But I’m optimistic that NASA will eventually get back on track and sanity will eventually return to Russia.

Demo code. The training and test data can be found at https://jamesmccaffrey.wordpress.com/2023/04/04/the-wheat-seeds-dataset-problem-using-pytorch/.

```# wheat_gauss_process_classify.py
# Gaussian process classification

# Anaconda3-2022.10  Python 3.9.13
# scikit 1.0.2  Windows 10/11

import numpy as np
from sklearn.gaussian_process import GaussianProcessClassifier

from sklearn.gaussian_process.kernels import RBF
from sklearn.gaussian_process.kernels import DotProduct
from sklearn.gaussian_process.kernels import WhiteKernel
from sklearn.gaussian_process.kernels import ConstantKernel

# -----------------------------------------------------------

def main():
# 0. prepare
print("\nBegin scikit Gaussian process classification ")
print("Predict wheat seed species (Kama=0, Rosa=1, \
np.random.seed(1)
np.set_printoptions(precision=4, suppress=True)

# -----------------------------------------------------------

train_file = ".\\Data\\wheat_train_k.txt"
usecols=[0,1,2,3,4,5,6],

test_file = ".\\Data\\wheat_test_k.txt"
usecols=[0,1,2,3,4,5,6],
print("Done ")

print("\nFirst few predictor values data: ")
print(train_X[0:4][:])
print(". . .")
print("\nFirst few actual species: ")
print(train_y[0:4])
print(". . .")

# -----------------------------------------------------------

# 2. create and train GPC model
print("\nCreating GPC model with default RBF kernel ")

# GaussianProcessClassifier(kernel=None, *,
#  optimizer='fmin_l_bfgs_b', n_restarts_optimizer=0,
#  max_iter_predict=100, warm_start=False, copy_X_train=True,
#  random_state=None, multi_class='one_vs_rest', n_jobs=None)
#
# RBF(length_scale=1.0, length_scale_bounds=(1e-5, 100000.0))
# DotProduct(sigma_0=1.0, sigma_0_bounds=(1e-5, 100000.0))
# WhiteKernel(noise_level=1.0,
#  noise_level_bounds=(1e-5, 100000.0))

krnl = 1.0 * RBF(1.0)
# krnl = RBF(1.0, length_scale_bounds=(1e-6, 100000.0))
# krnl = DotProduct() + WhiteKernel(noise_level=0.5)
# krnl = ConstantKernel(1.0, (1e-1, 1e3)) *
#  RBF(10.0, (1e-3, 1e3))
model = GaussianProcessClassifier(kernel=krnl, \
random_state=0)

print("\nTraining model ")
model.fit(train_X, train_y)
print("Done ")

# -----------------------------------------------------------

# 3. compute model accuracy
print("\nComputing accuracy ")
acc_train = model.score(train_X, train_y)
print("\nAccuracy on train data = %0.4f " % acc_train)
acc_test = model.score(test_X, test_y)
print("Accuracy on test data = %0.4f " % acc_test)

# 4. use model
X = np.array([[0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5]],
dtype=np.float32)
print("\nPredicting wheat species for dummy predictors ")
pred_species = model.predict_proba(X)
print("Prediction: ")
print(pred_species)

# 5. TODO: save model using pickle

print("\nEnd GPC demo ")

if __name__ == "__main__":
main()
```

## Parameterizing PyTorch Neural Network Architecture and Training Values for Evolutionary Optimization

When creating a neural network prediction model, you have to set values for the architecture (number hidden layers, number hidden nodes in each layer, hidden activation, etc.) and training (optimizer, batch size, etc.) In some scenarios you can manually experiment with these hyperparameter values. In other scenarios, you can set up lists of possible values and then use random search or grid search.

A more sophisticated approach is to use evolutionary optimization to find a good set of architecture and training values. This is a project I’ve been looking at recently. As part of my experiments, I put together a demo that parameterizes a network and training values and then computes a fitness value. The idea is best explained by code.

Suppose you want to predict the political leaning of a person (conservative = 0, moderate = 1, liberal = 2) from their sex (male = -1, female = +1), age (divided by 100), State (Michigan = 100, Nebraska = 010, Oklahoma = 001), and income (divided by \$100,000). Now, consider this code:

```  # first, create train_ds and test_ds
print("Setting 6-(10-10)-3 tanh 10 0.01 1000 SGD")
f = fitness(n_hid=10, activ='tanh',
trn_ds=train_ds, tst_ds=test_ds,
bs=10, lr=0.01, me=1000, opt='sgd')
print("Fitness = %0.4f " % f)
```

The fitness function creates a 6-(10-10)-3 neural network classifier with tanh() hidden node activation, and trains it using a batch size of 10, stochastic gradient descent with a learning rate of 0.01, and 1000 epochs. The return value is a measure of how good the network is, often called a fitness value in evolutionary optimization terminology.

The fitness() function is very short and simple because the function farms out most of the work to program-defined train() and accuracy() functions:

```def fitness(n_hid=10, activ='tanh', trn_ds=None, tst_ds=None,
bs=10, lr=0.01, me=1000, opt='sgd'):

T.manual_seed(1)  # prepare
np.random.seed(1)

net = Net(n_hid, activ).to(device)  # create

net.train()
train(net, trn_ds, bs, lr, me, opt)  # train

net.eval()
acc_train = accuracy_quick(net, trn_ds)  # evaluate
acc_test = accuracy_quick(net, tst_ds)
return (acc_train + acc_test) / 2
```

I decided to define fitness as the average of the accuracy of the trained network on the training and test data. This is something I need to give more thought to.

I don’t believe it’s feasible to create a general purpose framework for parameterization — each problem is significantly different. The real decisions are what to parameterize and what to hard-code. For example, my demo hard-codes the architecture with a fixed two hidden layers rather than a variable number of layers.

The parameterization is just the first part of an evolutionary optimization system. My next steps will be to add functions to generate random solutions, select two parent solutions, combine two parents to produce a child solution, and mutate child solutions.

Fascinating stuff (to me anyway).

Evolution has produced some strange animals. Left: Tullimonstrum, informally known as the Tully monster, is an extinct invertebrate that lived about 300 million years ago. It was about 14 inches long and had two primitive eye stalks. Right: Opabinia is an extinct arthropod that lived about 500 million years ago. It was about three inches long and had five eyes. Images like these in my head are one of several reasons why I don’t eat calamari.

Demo code below. The training and test data can be found at https://jamesmccaffrey.wordpress.com/2022/09/01/multi-class-classification-using-pytorch-1-12-1-on-windows-10-11/.

```# people_politics_encoded.py
# predict politics type from sex, age, state, income
# PyTorch 2.0.1-CPU Anaconda3-2022.10  Python 3.9.13
# Windows 10/11

# experiemnt for hyperparameter evolutionary optimization

import numpy as np
import torch as T
device = T.device('cpu')  # apply to Tensor or Module

# -----------------------------------------------------------

class PeopleDataset(T.utils.data.Dataset):
# sex  age    state    income   politics
# -1   0.27   0  1  0   0.7610   2
# +1   0.19   0  0  1   0.6550   0
# sex: -1 = male, +1 = female
# politics: conservative, moderate, liberal

def __init__(self, src_file):
tmp_x = all_xy[:,0:6]   # cols [0,6) = [0,5]
tmp_y = all_xy[:,6]     # 1-D

self.x_data = T.tensor(tmp_x,
dtype=T.float32).to(device)
self.y_data = T.tensor(tmp_y,
dtype=T.int64).to(device)  # 1-D

def __len__(self):
return len(self.x_data)

def __getitem__(self, idx):
preds = self.x_data[idx]
trgts = self.y_data[idx]
return preds, trgts  # as a Tuple

# -----------------------------------------------------------

class Net(T.nn.Module):
def __init__(self, n_hid, activ='tanh'):
super(Net, self).__init__()
self.hid1 = T.nn.Linear(6, n_hid)  # 6-(nh-nh)-3
self.hid2 = T.nn.Linear(n_hid, n_hid)
self.oupt = T.nn.Linear(n_hid, 3)

if activ == 'tanh':
self.activ = T.nn.Tanh()
elif activ == 'relu':
self.activ = T.nn.ReLU()

# use default weight init

def forward(self, x):
z = self.activ(self.hid1(x))
z = self.activ(self.hid2(z))
z = T.log_softmax(self.oupt(z), dim=1)  # NLLLoss()
return z

# -----------------------------------------------------------

def accuracy_quick(model, dataset):
# assumes model.eval()
X = dataset[0:len(dataset)][0]
Y = dataset[0:len(dataset)][1]
oupt = model(X)  #  [40,3]  logits
arg_maxs = T.argmax(oupt, dim=1)  # argmax() is new
num_correct = T.sum(Y==arg_maxs)
acc = (num_correct * 1.0 / len(dataset))
return acc.item()

# -----------------------------------------------------------

def train(net, ds, bs, lr, me, opt='sgd'):
# dataset, bat_size, lrn_rate, max_epochs, optimizer
shuffle=True)
loss_func = T.nn.NLLLoss()
if opt == 'sgd':
optimizer = T.optim.SGD(net.parameters(), lr=lr)

print("\nStarting training ")
for epoch in range(0, me):
epoch_loss = 0.0  # for one full epoch
for (batch_idx, batch) in enumerate(train_ldr):
X = batch[0]  # inputs
Y = batch[1]  # correct class/label/politics

oupt = net(X)
loss_val = loss_func(oupt, Y)  # a tensor
epoch_loss += loss_val.item()  # accumulate
loss_val.backward()
optimizer.step()

if epoch % le == 0:
print("epoch = %5d  |  loss = %10.4f" % \
(epoch, epoch_loss))
print("Done ")

# -----------------------------------------------------------

def fitness(n_hid=10, activ='tanh', trn_ds=None, tst_ds=None,
bs=10, lr=0.01, me=1000, opt='sgd'):

T.manual_seed(1)  # prepare
np.random.seed(1)

net = Net(n_hid, activ).to(device)  # create

net.train()
train(net, trn_ds, bs, lr, me, opt)  # train

net.eval()
acc_train = accuracy_quick(net, trn_ds)  # evaluate
acc_test = accuracy_quick(net, tst_ds)
return (acc_train + acc_test) / 2

# -----------------------------------------------------------

def main():
# 0. get started
print("\nBegin People predict politics type ")

print("\nCreating People Datasets ")

train_file = ".\\Data\\people_train.txt"
train_ds = PeopleDataset(train_file)  # 200 rows

test_file = ".\\Data\\people_test.txt"
test_ds = PeopleDataset(test_file)    # 40 rows

# 2. compute fitness for architecture and train parameters
print("\nSetting 6-(10-10)-3 tanh 10 0.01 1000 SGD")
f = fitness(n_hid=10, activ='tanh',
trn_ds=train_ds, tst_ds=test_ds,
bs=10, lr=0.01, me=1000, opt='sgd')
print("\nFitness = %0.4f " % f)

print("\nSetting 6-(8-8)-3 relu 10 0.01 1000 Adam")
f = fitness(n_hid=8, activ='relu',
trn_ds=train_ds, tst_ds=test_ds,
print("\nFitness = %0.4f " % f)

# 3. TODO: verify trained model is valid
# 4. TODO: save trained model

print("\nEnd People predict politics encoding demo")

if __name__ == "__main__":
main()
```

## Sorting a Python List of Objects for Evolutionary Algorithms

When I implement an Evolutionary Algorithm using the Python language, I’m faced with the question of how to store possible solutions (typically vectors of floating point values) and their associated errors (typically a single floating point value). There are many design possibilities, each with pros and cons.

One design choice is to use a Python list or NumPy array of objects. For example, a class that defines a possible solution and its associated error could be:

```class Solution:
def __init__(self, soln, err):
self.soln = soln
self.error = err

def show(self):
print(self.soln, end="")
print(" | ", end="")
print("%0.4f" % self.error)
```

Then, a list of solution-objects could be created like so:

```solns = []  # list of objects
solns.append( Solution(np.array([4,3,0,3],
dtype=np.int64), 10.0 ))
solns.append( Solution(np.array([1,3,1,2],
dtype=np.int64), 7.0 ))
solns.append( Solution(np.array([0,1,0,3],
dtype=np.int64), 4.0 ))
solns.append( Solution(np.array([4,3,2,1],
dtype=np.int64), 10.0 ))
solns.append( Solution(np.array([2,2,2,0],
dtype=np.int64), 6.0 ))
```

The first entry has a possible solution of [4,3,0,3] and associated error of 10.0. In a non-demo scenario, the associated error would be computed by some program-defined error function. For simplicity, I made the error for each potential solution the sum of the values in the solution.

A pro of using object-based storage for Evolutionary Algorithms is that the data is very easy to sort by error:

```solns.sort(key=lambda x:x.error)  # sorts in place
```

Other possible designs for Evolutionary Algorithms include 1.) using parallel arrays where one parallel array holds possible solution information (typically a vector) to some problem, and the other parallel array holds the error value associated with each possible solution, 2.) using a Python Tuple with solution and error fields, 3.) a Python List or NumPy Array where each item is a list or array with a possible solution and the associated error.

Even though there are technical pros and cons for each design choice for evolutionary algorithms, just as important to me are the subjective pros and cons. It’s often difficult to articulate subjective pros and cons but recognizing them is a characteristic of experienced engineers, as opposed to engineers who are relatively inexperienced. Inexperienced engineers typically are happy just to get a program up and running.

I wrote this blog post while sitting in the London Heathrow Airport, waiting to board a plane to fly back home (to the U.S.) Some of my favorite British authors include Charles Dickens, Ian Fleming, Agatha Christie, Sir Arthur Conan Doyle, and J.K. Rowling. Here are three book cover designs for the first Harry Potter book, “Harry Potter and the Sorcerer’s Stone” (“Harry Potter and the Philosopher’s Stone” in the U.K.) Book cover design is very subjective and most people seem to prefer the version of the cover design that they read.

Demo program code:

```# object_sort_demo.py

import numpy as  np

# ------------------------------------

class Solution:
def __init__(self, soln, err):
self.soln = soln
self.error = err

def show(self):
print(self.soln, end="")
print(" | ", end="")
print("%0.4f" % self.error)

# ------------------------------------

solns = []  # list of objects
solns.append( Solution(np.array([4,3,0,3],
dtype=np.int64), 10.0 ))
solns.append( Solution(np.array([1,3,1,2],
dtype=np.int64), 7.0 ))
solns.append( Solution(np.array([0,1,0,3],
dtype=np.int64), 4.0 ))
solns.append( Solution(np.array([4,3,2,1],
dtype=np.int64), 10.0 ))
solns.append( Solution(np.array([2,2,2,0],
dtype=np.int64), 6.0 ))

print("\nsolns: ")
for i in range(len(solns)):
solns[i].show()

solns.sort(key=lambda x:x.error)  # sorts in place

print("\nsorted solns: ")
for i in range(len(solns)):
solns[i].show()

print("\nEnd demo ")
```

## “Researchers Successfully Predict NFL Professional Football Scores” on the Pure AI Web Site

I contributed to an article titled “Researchers Successfully Predict NFL Professional Football Scores” on the April 2023 edition of the Pure AI web site. See https://pureai.com/articles/2023/04/03/zoltar.aspx.

Zoltar is my prediction system for American NFL football. For the 2022-23 NFL season, Zoltar achieved 65% accuracy against the Las Vegas point spread.

Zoltar is the result of an informal collaboration among myself and two of my machine learning colleagues. We work at three different large tech companies. Zoltar is strictly a research project and isn’t used for actual wagering.

The Zoltar program uses a combination of reinforcement learning and deep neural network technology to predict football results. The phrase “predicting football results” has several different meanings. Zoltar predicts against the point spread, which is best explained by an example.

Suppose you are interested in an upcoming game between the Chicago Bears and the Detroit Lions. Several “sports books,” mostly based in Las Vegas, publish an opening point spread such as “Lions -6.0 at Bears.” This means the Lions are favored by six points and are playing at the Bears home field. If a person wagers money on the favored Lions, the wager will win only if the Lions win by more than six points. If the Lions win but by less than six points, or if the underdog Bears win by any score, the wager on the Lions loses.

The Zoltar program computes the expected margin of victory between two teams. Zoltar then suggests a hypothetical wager when the difference between the computed margin of victory and the point spread margin of victory is greater than 4.0 points.

I provided some quotes. “Probability can trace its roots to mathematician Gerolamo Cardano who wrote a book titled ‘Games of Chance’ in the mid-1500s.”

“The Zoltar program was created to explore machine learning algorithms applied to problems where mathematics interacts with human psychology. For example, some people will bet with their hearts rather than with their heads, by wagering on the team located in the city where they live. If enough people do this, it can create mathematical imbalances between the point spread and a mathematically predicted margin of victory.”

I played Electric Football when I was a young man. The game was invented in 1948 by Tudor Metal Products Corporation and it’s still popular today. It was basically impossible to predict how the players would move on the vibrating playing field, but that was part of the fun. Left: The basic set is quite plain. Right: Some people paint their players and use a lot of detail.

## An Example of AdaBoost Classification Using the scikit Library

Basic decision trees have several weaknesses and so there are many enhanced tree models. These include, in order of increasing complexity, bootstrap aggregation (“bagging”), random forest, adaptive boosting (“AdaBoost”), and gradient boosting. There are many variations of each of the four enhanced tree models.

Note: Gradient boosting is an advanced form of AdaBoost, and XGBoost (“extreme boosting”) is an advanced form of gradient boosting. The XGBoost algorithm is not directly implemented in the scikit library.

I put together a demo of the scikit AdaBoost module.

In very high-level pseudo-code, the AdaBoost algorithm looks like:

```create a primitive decision stub tree
loop 50 times
create a new weighted decison stub
end-loop
model = majority vote of the 50 trees
```

The pseudo-code omits many important details. Here’s another version of pseudo-code that has more details. It assumes a binary classification scenario, where the two classes are coded as -1 and +1.

There are several variations of AdaBoost. They’re all fairly complex but the Wikipedia article on AdaBoost is pretty good (unlike many Wikipedia machine learning articles).

I put together a demo. I used one of my standard multi-class classification problems. The data looks like:

``` 1   0.24   1   0   0   0.2950   2
0   0.39   0   0   1   0.5120   1
1   0.63   0   1   0   0.7580   0
0   0.36   1   0   0   0.4450   1
. . .
```

Each line of data represents a person. The fields are sex (male = 0, female = 1), age (normalized by dividing by 100), state (Michigan = 100, Nebraska = 010, Oklahoma = 001), annual income (divided by 100,000), and politics type (0 = conservative, 1 = moderate, 2 = liberal). The goal is to predict politics type from sex, age, state, income. There are 200 training items and 40 test items.

The signature of the AdaBoost module constructor is deceptively simple:

```  # AdaBoostClassifier(base_estimator=None, *, n_estimators=50,
#  learning_rate=1.0, algorithm='SAMME.R',
#  random_state=None)
```

The actual parameter complexity comes from the internal DecisionTreeClassifier which is used by default as the estimator:

```  # DecisionTreeClassifier(*, criterion='gini',
#  splitter='best', max_depth=None, min_samples_split=2,
#  min_samples_leaf=1, min_weight_fraction_leaf=0.0,
#  max_features=None, random_state=None,
#  max_leaf_nodes=None, min_impurity_decrease=0.0,
#  class_weight=None, ccp_alpha=0.0)
```

For my demo I created an AdaBoost classifier with all default parameters except for supplying a random_state value so that results are reproducible:

```  print("Creating AdaBoost model using default params ")
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(max_depth=1)
n_estimators=50, learning_rate=1.0, random_state=1)
model.fit(train_x, train_y)
print("Done ")
```

The results weren’t very good. As usual with most tree-based classifiers, prediction accuracy on the training data is very good but the model was overfitted and had poor accuracy on the test data.

Variations of decision tree classifiers are seductive in the sense that they’re very simple and easy to understand. But neural network classifiers have enabled the fantastic breakthroughs in artificial intelligence and machine learning. Even so, tree-based classifiers can still be useful in many real-world scenarios.

Good fun.

Machine learning decision tree models are seductive to newcomers to machine learning but tree models often turn out well. Female alien seduction in science fiction movies usually doesn’t turn out well for the seductee.

Left: In “Lifeforce” (1985) the crew of a space shuttle discovers a huge alien spaceship with the bodies of two men and a woman. Sure, let’s bring them to Earth. Unfortunately, all three are space vampires, including the one female known as Space Girl. This is a pretty good movie.

Center: In “Queen of Blood” (1966), a crew from Earth goes to Mars and discovers a crashed alien spaceship. There’s a female alien inside. Sure, let’s bring her back to Earth. Unfortunately she is a space vampire who can seduce with glowing eyes. Not a bad movie if you’re a fan of old sci-fi B quality movies like I am.

Right: In “Species” (1995) scientists receive information from aliens about how to splice their DNA with human DNA. Sure, let’s try that on Earth. Unfortunately, the result is a super alien woman named Sil who wants to reproduce. The consequences of mating with Sil are not pleasant for her male victims. This is a surprisingly good movie.

Demo code below. The training and test data can be found at https://jamesmccaffrey.wordpress.com/2023/02/13/multi-class-classification-using-a-scikit-decision-tree.

```# people_politics_adaboost.py

# predict politics (0 = con, 1 = mod, 2 = lib)
# from sex, age, state, income.

# sex  age    state    income   politics
#  0   0.27   0  1  0   0.7610   2
#  1   0.19   0  0  1   0.6550   0
# sex: 0 = male, 1 = female
# state: michigan = 100, nebraska = 010, oklahoma = 001
# politics: conservative, moderate, liberal

# Anaconda3-2022.10  Python 3.9.13  scikit 1.0.2
# Windows 10/11

import numpy as np

# ---------------------------------------------------------

def show_confusion(cm):
dim = len(cm)
mx = np.max(cm)             # largest count in cm
wid = len(str(mx)) + 1      # width to print
fmt = "%" + str(wid) + "d"  # like "%3d"
for i in range(dim):
print("actual   ", end="")
print("%3d:" % i, end="")
for j in range(dim):
print(fmt % cm[i][j], end="")
print("")
print("------------")
print("predicted    ", end="")
for j in range(dim):
print(fmt % j, end="")
print("")

# ---------------------------------------------------------

def main():
print("Predict politics from sex, age, State, income ")
np.random.seed(1)
np.set_printoptions(precision=4, suppress=True)

# sex  age    state    income   politics
#  0   0.27   0  1  0   0.7610   2
#  1   0.19   0  0  1   0.6550   0

train_file = ".\\Data\\people_train.txt"
train_x = train_xy[:,0:6]
train_y = train_xy[:,6].astype(int)

test_file = ".\\Data\\people_test.txt"
test_x = test_xy[:,0:6]
test_y = test_xy[:,6].astype(int)

print("\nTraining data:")
print(train_x[0:4])
print(". . . \n")
print(train_y[0:4])
print(". . . ")

# ---------------------------------------------------------

# 2. create and train
#  learning_rate=1.0, algorithm='SAMME.R',
#  random_state=None, base_estimator='deprecated')

# DecisionTreeClassifier(*, criterion='gini',
#  splitter='best', max_depth=None, min_samples_split=2,
#  min_samples_leaf=1, min_weight_fraction_leaf=0.0,
#  max_features=None, random_state=None,
#  max_leaf_nodes=None, min_impurity_decrease=0.0,
#  class_weight=None, ccp_alpha=0.0)

print("\nCreating AdaBoost model using default params ")
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(max_depth=1)
n_estimators=50, learning_rate=1.0, random_state=1)
model.fit(train_x, train_y)
print("Done ")

# 3. evaluate
acc_train = model.score(train_x, train_y)
print("\nAccuracy on train = %0.4f " % acc_train)
acc_test = model.score(test_x, test_y)
print("Accuracy on test = %0.4f " % acc_test)

# 3b. display formatted confusion matrix
from sklearn.metrics import confusion_matrix
y_predicteds = model.predict(test_x)
cm = confusion_matrix(test_y, y_predicteds)
print("\nConfusion matrix: \n")
show_confusion(cm)

# 4. use model
print("\nPredict for: M 35 Nebraska \$55K ")
X = np.array([[0, 0.35, 0,1,0, 0.5500]],
dtype=np.float32)
probs = model.predict_proba(X)
print("\nPrediction pseudo-probs: ")
print(probs)

politic = model.predict(X)
print("\nPredicted class: ")
print(politic)

# 5. TODO: save model using pickle

if __name__ == "__main__":
main()
```

## A Quick Look at the FLAML Library for Automatic Machine Learning

I was sitting at home one weekend, waiting for the rain to stop so I could walk my dogs. I have spoken at the PyData conference before, but I’m not speaking this year. I was scanning through the conference agenda and noticed that one of the talks is about the FLAML library (“Fast and Lightweight Auto ML”) for automatic ML. See https://microsoft.github.io/FLAML/. I figured I’d investigate.

There are many different AutoML libraries. I think of AutoML as an extension of automated hyperparameter search. In hyperparameter search, you set a machine learning model, such as kernel ridge regression, that suits the prediction problem at hand. Then a hyperparameter search program performs an automated examination to find the best values for the model’s parameters.

Installing FLAML worked without problems.

In AutoML, you don’t specify the model. You let the AutoML system try different models and different sets of relevant hyperparameters.

Installing FLAML was easy; just “pip install flaml”. That was nice. Trust me, setting up a library that installs using pip is no small task.

The FLAML documentation was quite good. The documentation has code examples instead of blah, blah, blah about architecture.

I used the FLAML documentation regression example as a template to tackle one of my standard regression problems. The goal is to predict a person’s income (divided by \$100,000) from sex (male = 0, female = 1), age (divided by 100), State (Michigan = 100, Nebraska = 010, Oklahoma = 001), and political leaning (conservative = 100, moderate = 010, liberal = 001).

FLAML is impressive, but I’m just not a fan of AutoML systems for the type of work I do.

My experiment demo worked the first time. Nice.

However, even though I was impressed with FLAML, I am not a fan of AutoML systems in general. In my opinion, AutoML systems operate at too high of a level of abstraction. Machine learning is very tricky and leaving it to black box systems just doesn’t feel right to me, especially for the type of ML work I do.

Robots and androids are the epitome of human automation. Left: In “Metropolis” (1927), a robot named Maria is created by an evil scientist. A pretty good movie if you’re a fan of science fiction movie history. I give the movie a B grade. Center: In “Futureworld” (1976), there are many different kinds of robots including those for companionship. I liked the part where it is revealed that robots are designing new robots that are designing new robots. I give the movie a C grade. Right: In “Surrogates” (2009), almost all people sit in pods and live their lives through robots. I give the movie a B grade.

Demo code below. You can find the data at https://jamesmccaffrey.wordpress.com/2022/10/10/regression-people-income-using-pytorch-1-12-on-windows-10-11/.

```# people_income_flaml.py
#
# Anaconda3-2022.10  Python 3.9.13
# Windows 10/11
# FLAML 1.2.0

# predict income from sex, age, State, politics

import numpy as np
from flaml import AutoML

# 0. prepare
print("\nPredict income using FLAML AutoML ")
np.random.seed(1)

train_file = ".\\Data\\people_train.txt"
dtype=np.float32)
train_X = train_xy[:,[0,1,2,3,4,6,7,8]]
train_y = train_xy[:,5].flatten()  # 1D required

# 2. Initialize an AutoML instance
print("\nCreating a FLAML object ")
automl = AutoML()
automl_settings = {
"time_budget": 1,  # in seconds
"metric": 'r2',
"log_file_name": "people_income.log"
}

# 3. find and train model
print("\nFinding best model ")
automl.fit(X_train=train_X, y_train=train_y,
**automl_settings)
print("Done ")

# 4. show best model found
best_model = automl.model.estimator
print("\nBest model: ")
print(best_model)

# 5. use model
print("\nPredict income for Male, 24, Michigan, liberal ")
X = np.array([[1, 0.24, 1,0,0,  0,0,1]],
dtype=np.float32)
pred_inc = automl.predict(X)
print(pred_inc)

print("\nEnd FLAML demo ")
```

## Don’t Blindly Trust All Machine Learning Datasets

I ran across a machine learning example that used the California Housing dataset. I didn’t know much about that dataset so I did a little exploration. I loaded and examined the dataset using the scikit library:

```from sklearn.datasets import fetch_california_housing

bunch = fetch_california_housing(as_frame=True)
info = bunch.frame.info()
print(info)
desc = bunch.frame.describe()
print(desc)
```

The info() is, in part:

```  :Number of Instances: 20640
:Number of Attributes: 8 numeric, predictive and the target
:Attribute Information:
- MedInc        median income in block group
- HouseAge      median house age in block group
- AveRooms      average number of rooms per household
- AveBedrms     average number of bedrooms per household
- Population    block group population
- AveOccup      average number of household members
- Latitude      block group latitude
- Longitude     block group longitude
- MedHouseVal   median house value in block (div 100,000)
:Missing Attribute Values: None
```

So, essentially, the idea is to predict the median house value (in 1990) in a California “block group” — a census area.

OK, but the more detailed information showed these max and min values:

```       MedInc Age AveRooms  AveBedrms   Pop  AveOccup  MedVal
min    0.49    1      0.84      0.33      3      0.69  0.14
max   15.00   52    141.90     34.06  35682   1243.33  5.00
```

The data is wacky. For example, the maximum average occupancy in one census block group is 1,243.33 people per house. And one block group has an average of 141.90 rooms per house. What?

Prisons have a big average occupancy per “house”.

Well, I spent some time diving into the data. I sorted the data by average occupancy per house and found the associated latitude and longitude of the California block group that has an average of 1243.33 people per house. It turned out to be at lat/lon (38.32, -121.98), which . . . drum roll please . . . contains the Solano State Prison about 150 miles east of San Francisco. Mystery solved.

The conclusion is that the California Housing dataset is not usable as-is for machine learning purposes. The dataset requires many hours of preparation.

The moral is don’t blindly trust machine learning datasets.

I’ve never been in prison but it doesn’t sound like it’d be much fun. A chain gang is a group of prisoners chained together to do things such as road work. Chain gangs were introduced in the late 1860s but were mostly phased out by the mid 1950s.

On any given day, there are about 2 million people in U.S. prisons. Incarceration rates vary greatly. According to the U.S. Bureau of Justice, less than 0.03% of Asian males are ever jailed at some point in their lives but over 30% of Black males are jailed and in some census blocks in Baltimore, over 90% are jailed eventually. But I don’t blindly trust the data — it could be better or even worse.

Left: A chain gang from the early 20th century.

Right: The comedy movie “Pardon Us” (1931) featured Stan Laurel and Oliver Hardy. A very funny movie (to me anyway — funny is in the eye of the beholder).

## An Example of Random Forest Classification Using the scikit Library

Basic decision trees have several weaknesses and so there are many enhanced tree models. These include, in order of increasing complexity, bootstrap aggregation (“bagging”), random forest, adaptive boosting (“AdaBoost”), and gradient boosting. There are many variations of each of the four enhanced tree models.

I put together a demo of the scikit random forest module.

In very high-level pseudo-code, scikit default random forest is:

```loop N times
fetch a random subset of training data
create a basic decision tree from subset
end-loop
model = majority vote of the N trees
```

The random subset of the training data items is selected by picking only some of the predictor variables. For example, if there are 6 predictor variables, then each tree might be based on just 3 of the predictors. The scikit default is the square root of the number of predictors (rounded to the nearest integer).

The idea is that the order of the data will change for each sub-tree and robustness is introduced by looking at different sets of predictors. These two operations reduce model overfitting which is the major weakness of tree classifiers

I put together a demo. I used one of my standard multi-class classification problems. The data looks like:

``` 1   0.24   1   0   0   0.2950   2
0   0.39   0   0   1   0.5120   1
1   0.63   0   1   0   0.7580   0
0   0.36   1   0   0   0.4450   1
. . .
```

Each line of data represents a person. The fields are sex (male = 0, female = 1), age (normalized by dividing by 100), state (Michigan = 100, Nebraska = 010, Oklahoma = 001), annual income (divided by 100,000), and politics type (0 = conservative, 1 = moderate, 2 = liberal). The goal is to predict politics type from sex, age, state, income. There are 200 training items and 40 test items.

The signature of the random forest module constructor is complex:

```  # RandomForestClassifier(n_estimators='warn',
#  criterion='gini', max_depth=None, min_samples_split=2,
#  min_samples_leaf=1, min_weight_fraction_leaf=0.0,
#  max_features='auto', max_leaf_nodes=None,
#  min_impurity_decrease=0.0, min_impurity_split=None,
#  bootstrap=True, oob_score=False, n_jobs=None,
#  random_state=None, verbose=0, warm_start=False,
#  class_weight=None)
```

It would take at least a couple of pages to explain all these parameters but the two most important are the n_estimators (number of trees) and max_features (number of randomly selected predictors for each tree). Also important is the random_state parameter to get reproducible results. For my demo I tried:

```  print("Creating RandomForestClassifier model ")
model = RandomForestClassifier(n_estimators=10,
max_features=3, random_state=1)

model.fit(train_x, train_y)
print("Done ")
```

The results weren’t very good: as usual with most tree-based classifiers, prediction accuracy on the training data is excellent, but the model was overfitted and had poor accuracy on the test data.

Among my machine learning colleagues, guys tend to fall into one of two buckets: those who use mostly tree technques and those who use mostly neural techniques. I tend to use neural techniques but I’ll often look at a tree model too to see if the models agree.

Good fun.

One of my favorite movie genres is fantasy. Many fantasy films feature memorable forest scenes. Here are three forest scenes randomly selected from my memory. Left: In “The Fellowship of the Ring” (2001), the Hobbits are pursued by the Dark Riders in the forest. Very scary! Center: In “Labyrinth” (1986), Sarah is searching for her baby brother who was stolen by Jareth, the Goblin King. Not scary. Right: In “The Brothers Grimm” (2005), Wilhelm and Jacob must go through a very evil forest with very evil trees to get to the castle of the very evil queen. Very scary.

Demo code below. The training and test data can be found at https://jamesmccaffrey.wordpress.com/2023/02/13/multi-class-classification-using-a-scikit-decision-tree/

```# people_politics_forest.py

# predict politics (0 = con, 1 = mod, 2 = lib)
# from sex, age, state, income.
# uses random forest

# sex  age    state    income   politics
#  0   0.27   0  1  0   0.7610   2
#  1   0.19   0  0  1   0.6550   0
# sex: 0 = male, 1 = female
# state: michigan = 100, nebraska = 010, oklahoma = 001
# politics: conservative, moderate, liberal

# Anaconda3-2022.10  Python 3.9.13  scikit 1.0.2
# Windows 10/11

import numpy as np
from sklearn.ensemble import RandomForestClassifier

# ---------------------------------------------------------

def show_confusion(cm):
dim = len(cm)
mx = np.max(cm)             # largest count in cm
wid = len(str(mx)) + 1      # width to print
fmt = "%" + str(wid) + "d"  # like "%3d"
for i in range(dim):
print("actual   ", end="")
print("%3d:" % i, end="")
for j in range(dim):
print(fmt % cm[i][j], end="")
print("")
print("------------")
print("predicted    ", end="")
for j in range(dim):
print(fmt % j, end="")
print("")

# ---------------------------------------------------------

def main():
print("\nBegin scikit random forest example ")
print("Predict politics from sex, age, State, income ")
np.random.seed(1)
np.set_printoptions(precision=4, suppress=True)

# sex  age    state    income   politics
#  0   0.27   0  1  0   0.7610   2
#  1   0.19   0  0  1   0.6550   0

train_file = ".\\Data\\people_train.txt"
train_x = train_xy[:,0:6]
train_y = train_xy[:,6].astype(int)

test_file = ".\\Data\\people_test.txt"
test_x = test_xy[:,0:6]
test_y = test_xy[:,6].astype(int)

print("\nTraining data:")
print(train_x[0:4])
print(". . . \n")
print(train_y[0:4])
print(". . . ")

# ---------------------------------------------------------

# 2. create and train
# RandomForestClassifier(n_estimators='warn',
#  criterion='gini', max_depth=None, min_samples_split=2,
#  min_samples_leaf=1, min_weight_fraction_leaf=0.0,
#  max_features='auto', max_leaf_nodes=None,
#  min_impurity_decrease=0.0, min_impurity_split=None,
#  bootstrap=True, oob_score=False, n_jobs=None,
#  random_state=None, verbose=0, warm_start=False,
#  class_weight=None)

print("\nCreating RandomForestClassifier model ")
model = RandomForestClassifier(n_estimators=10,
max_features=3, random_state=1)
model.fit(train_x, train_y)
print("Done ")

# 3. evaluate
acc_train = model.score(train_x, train_y)
print("\nAccuracy on train = %0.4f " % acc_train)
acc_test = model.score(test_x, test_y)
print("Accuracy on test = %0.4f " % acc_test)

# 3b. display formatted confusion matrix
from sklearn.metrics import confusion_matrix
y_predicteds = model.predict(test_x)
cm = confusion_matrix(test_y, y_predicteds)
print("\nConfusion matrix: \n")
show_confusion(cm)

# 4. use model
print("\nPredict for: M 35 Nebraska \$55K ")
X = np.array([[0, 0.35, 0,1,0, 0.5500]],
dtype=np.float32)
probs = model.predict_proba(X)
print("\nPrediction pseudo-probs: ")
print(probs)

politic = model.predict(X)
print("\nPredicted class: ")
print(politic)

# 5. TODO: save model using pickle

print("\nEnd scikit random forest demo ")

if __name__ == "__main__":
main()
```

## “Regression Using a scikit MLPRegressor Neural Network” in Visual Studio Magazine

I wrote an article titled “Regression Using a scikit MLPRegressor Neural Network” in the May 2023 edition of Microsoft Visual Studio Magazine. See https://visualstudiomagazine.com/articles/2023/05/01/regression-scikit.aspx.

A regression problem is one where the goal is to predict a single numeric value. For example, you might want to predict the annual income of a person based on their sex, age, state where they live and political leaning. Note that the common “logistic regression” machine learning technique is actually a binary classification system in spite of its name.

Arguably the most powerful regression technique is a neural network model. There are several tools and code libraries that you can use to create a neural network regression model. I usually use PyTorch but the scikit-learn library (also called scikit or sklearn) is based on the Python language and is popular among beginners. The original versions of scikit were built around classical machine learning techniques such as categorical Gaussian naive Bayes classification, kernel ridge regression, and various forms of decision trees. But neural network classifiers and regression systems were added in about 2016 or 2017 when it was clear that neural techniques were the trend of the future of ML.

I used one of my standard synthetic datasets. The goal is to predict annual income from sex (male = -1, female = +1), age (divided by 100), State of residence (Michigan = 100, Nebraska = 010, Oklahoma = 001), and political leaning (conservative = 100, moderate = 010, liberal = 001). The data looks like:

``` 1   0.24   1 0 0   0.2950   0 0 1
-1   0.39   0 0 1   0.5120   0 1 0
1   0.63   0 1 0   0.7580   1 0 0
-1   0.36   1 0 0   0.4450   0 1 0
1   0.27   0 1 0   0.2860   0 0 1
. . .
```

There are 200 training items and 40 test items. The difficult part of any neural network is setting up the many hyperparameters. The demo uses:

```  import numpy as np
from sklearn.neural_network import MLPRegressor

# 2. create network
params = { 'hidden_layer_sizes' : [10,10],
'activation' : 'relu', 'solver' : 'adam',
'alpha' : 0.0, 'batch_size' : 10,
'random_state' : 0, 'tol' : 0.0001,
'nesterovs_momentum' : False,
'learning_rate' : 'constant',
'learning_rate_init' : 0.01,
'max_iter' : 1000, 'shuffle' : True,
'n_iter_no_change' : 50, 'verbose' : False }
```

After setting up the parameters, creating and training the neural network regression prediction system is trivial:

```  print("Creating 8-(10-10)-1 relu neural network ")
net = MLPRegressor(**params)
net.fit(train_x, train_y)
print("Done ")
```

Using the scikit MLPRegressor module is relatively easy (assuming you understand what the hyperparameters mean), but the technique isn’t very flexible. For flexibility, you should use PyTorch but PyTorch requires much more effort.

For children, one of the strongest predictor variables for their adult income is the family structure in which they’re raised. Sadly, there are enormous differences in family structure depending upon race. In some racial groups, over 78% of children are born and raised by single mothers, and in some Census areas (such as parts of Baltimore), that figure is over 95%. I’ve never come across research that explains why this self-destructive behavior occurs. Maybe AI/ML can help somehow.

## Sorting a Python List of Tuples for Evolutionary Algorithms

When I implement an Evolutionary Algorithm using the Python language, I’m faced with the question of how to store possible solutions (typically vectors) and their associated errors (typically a single float32 value). There are many, many, many design possibilities that have pros and cons — if one technique was best I wouldn’t be writing this blog post.

One design choice is to use a Python List of Array of Tuples where the first tuple item is a possible solution and the second tuple item is the associated error. For example:

```solns = []  # list of tuples (array,float)
solns.append( (np.array([4,3,0,3], dtype=np.int64), 10.0) )
solns.append( (np.array([1,3,1,2], dtype=np.int64), 7.0) )
solns.append( (np.array([0,1,0,3], dtype=np.int64), 4.0) )
solns.append( (np.array([4,3,2,1], dtype=np.int64), 10.0) )
solns.append( (np.array([2,2,2,0], dtype=np.int64), 6.0) )
```

The first entry has a possible solution of [4,3,0,3] and associated error of 10.0. In a non-demo scenario, the associated error would be computed by some program-defined error function. For simplicity, I made the error for each potential solution the sum of the values in the solution. Note that a con of using a Tuple is that items are immutable which can create issues depending on the design of the Evolutionary Algorithm using the tuples.

A pro of using Tuple storage for Evolutionary Algorithms is that the data is very easy to sort by error:

```solns.sort(key=lambda tup:tup[1])  # sorts in place
```

Other possible designs for Evolutionary Algorithms include 1.) using parallel arrays where one parallel array holds possible solution information (typically a vector) to some problem, and the other parallel array holds the error value associated with each possible solution, 2.) using a Python class with solution and error fields, 3.) a Python List or NumPy Array where each item is a list or array with a possible solution and the associated error.

The moral of this blog post is that for experienced machine learning engineers, picking the design of a system is usually more difficult than the corresponding implementation.

I wrote this blog post while I was sitting in a hotel room in London, waiting for friends and family to run in the 2023 London marathon. London is a very interesting city in many different ways. Many British sports car designs of the 1960s were very beautiful, but implementation, especially electrical and hydraulic systems, was not very good.

Left: A 1967 model Austin-Healey 3000. To my eye, one of the most beautiful classic sports car designs. Center: A Triumph TR6 from about 1969. I played in a band in the 1980s and our lead guitarist, Wayne Ogin, had a beautiful British racing green colored TR6 but it was quite unreliable and was often in the repair shop. Right: A 1963 Aston-Martin DB5 — the famous James Bond car — the epitome of cool design.

Demo program code:

```# tuple_sort_demo.py

import numpy as  np

solns = []  # list of tuples (array,float)
solns.append( (np.array([4,3,0,3], dtype=np.int64), 10.0) )
solns.append( (np.array([1,3,1,2], dtype=np.int64), 7.0) )
solns.append( (np.array([0,1,0,3], dtype=np.int64), 4.0) )
solns.append( (np.array([4,3,2,1], dtype=np.int64), 10.0) )
solns.append( (np.array([2,2,2,0], dtype=np.int64), 6.0) )

print("\nsolns: ")
for i in range(len(solns)):
print(solns[i])

solns.sort(key=lambda tup:tup[1])  # sorts in place

# solns.sort(key=lambda tup:tup[1], reverse=True)
# solns = sorted(soln, key=lambda tup:tup[1])

print("\nsorted solns: ")
for i in range(len(solns)):
print(solns[i])

print("\nEnd demo ")
```