## Ordinal Classification Using Keras

An ordinal classification problem (confusingly, also called ordinal regression) is one where the goal is to predict a class label in situations where the labels have an ordering. For example, you might want to predict the price of a house, based on things like area in sq. feet, where the house price in the training data is 0 = low, 1 = medium, 2 = high, 3 = very high. You could just use regular neural classification techniques, but that doesn’t take advantage of the ordering information in the data. Put differently, if a true class label is 2 = high, the error for a prediction of 0 = low should be greater than a prediction of 1 = medium.

For ordinal classification, I use a technique that I haven’t seen described anywhere else. But the idea is obvious so maybe the technique is used under some fancy name. If the training data has ordinal class labels like 0, 1, 2, 3 then I convert them to float targets of 0.125, 0.375, 0.625, 0.875. I create a neural network that emits a single numeric value between 0.0 and 1.0 and use mean squared error to compare a computed output with the associated float target. If you think this through, you’ll see how the ordering information is used.

I recently upgraded my Keras code library to version 2.6 and so I figured I’d code up a demo of ordinal classification using that version. I generated a 200-item set of synthetic training data that looks like:

```-1   0.1275   0   1   0   2   0   0   1
1   0.1100   1   0   0   3   1   0   0
-1   0.1375   0   0   1   0   0   1   0
1   0.1975   0   1   0   2   0   0   1
. . .
```

Each item is a house. The first column is air conditioning, the second column is area in square feet (divided by 10,000), the next three columns are one-hot encoded style (1,0,0 = art_deco, 0,1,0 = bungalow, 0,0,1 = colonial), the next column is price (0 = low, 1 = medium, 2 = high, 3 =very high), and the last three columns are local school (1,0,0 = johnson, 0,1,0 = kennedy, 0,0,1 = lincoln).

The key to the ordinal classification technique I use is mapping ordinal labels to float targets. For k = 4 classes, the idea can be explained graphically:

```
0-------------1------------2------------3------------4
0.00         0.25         0.50         0.75         1.00
0.125        0.375        0.625        0.875
```

There are 4 bins, one for each class label. The float targets are the midpoints of the bins if the bins length is normalized to 1.0. A function to compute targets for ordinal classification is:

```def make_float_targets(k):
targets = np.zeros(k, dtype=np.float32)
start = 1.0 / (2 * k)  # like 0.125
delta = 1.0 / k        # like 0.250
for i in range(k):
targets[i] = start + (i * delta)
return targets
```

I coded up a demo using Keras 2.6 without too much trouble, other than the usual glitches that happen with any neural system. I noticed that when I computed classification accuracy, using an item-by-item approach was brutally slow. I suspect this is because there is a lot of conversion between Numpy arrays and Keras/TensorFlow tensors. Anyway, I wrote an accuracy function that used a set approach.

Good fun. Neural network technologies have advanced quickly, but are still relatively crude. When more powerful computing engines become available (probably via quantum computing), neural networks will do things that are impossible to imagine today. Advances in aircraft engines enabled amazing performance improvements in just a few years. Left: The British S.E.5a (1917) had a top speed of 130 mph. Center: Just 20 years later, the British Spitfire Mk I (1937) had a top speed of 360 mph. Right: Just 20 years later, the U.S. Vought F-8 Crusader (1957) had a top speed of 1,200 mph.

Code and data below. Long. Continue reading

## Natural Language Question-Answering Using Hugging Face

I’m currently on a multi-week mission to explore the Hugging Face (HF) code library for Transformer Architecture (TA) systems for natural language processing (NLP) and today I did a question-answer (QA) example. Whew! That’s a lot of acronyms in an introductory sentence (IS)!

TA systems are extraordinarily complex, so implementing a TA system from scratch or using a low-level library like PyTorch or or Keras is only barely feasible. The HF library makes writing TA systems much simpler — with the downside that customizing a TA system built on HF can be very difficult. My approach to learning a new technology is to 1.) get a documentation example working, 2.) refactor the example, 3.) repeat until the overall picture gels in my head.

My recent example is extractive question-answer. I set up a raw text source corpus of a few sentences from the Wikipedia article on Transformers. Then I created a BERT-based model using HF and used the model to answer the question, “How do transformers work?” The computed answer was “deep learning model that adopts the mechanism of attention, differentially weighing the significance of each part of the input data.”

The moral of the story is that there are no shortcuts when it comes to learning a complex new software library or framework. You have to take one step at a time. Today was one of those steps for me. Here are three beach photos where there is no answer to the question, “Why?”

Demo code:

```# qa_test.py

# Python 3.7.6 (Anaconda3-2020.02)
# PyTorch 1.9.0 CPU, HugFace 4.2.2, Windows 10

# extractive question-answering using Hugging Face

from transformers import AutoTokenizer, \
import torch as T

def main():
print("\nBegin extractive question-answer using Hugging Face ")

corpus = r"""
A transformer is a deep learning model that adopts the
mechanism of attention, differentially weighing the
significance of each part of the input data. It is used
primarily in the field of natural language processing
(NLP) and in computer vision (CV).

Like recurrent neural networks (RNNs), transformers are
designed to handle sequential input data, such as natural
language, for tasks such as translation and text
summarization. However, unlike RNNs, transformers do not
necessarily process the data in order. Rather, the
attention mechanism provides context for any position in
the input sequence.
"""

toker = \
AutoTokenizer.from_pretrained \
model = \

quest = "How do transformers work?"
print("\nThe question: ")
print(quest)

return_tensors="pt")
inpt_ids = inpts["input_ids"].tolist()
oupts = model(**inpts)

ans_start_scores = oupts.start_logits
ans_end_scores = oupts.end_logits

ans_start = T.argmax(ans_start_scores)
ans_end = T.argmax(ans_end_scores) + 1

ans = \
toker.convert_tokens_to_string \
(toker.convert_ids_to_tokens(inpt_ids[ans_start:ans_end]))
print(ans)

print("\nEnd demo ")

if __name__ == "__main__":
main()
```

## NFL 2021 Week 3 Predictions – Zoltar Is Obsessed With Underdogs

Zoltar is my NFL football prediction computer program. It uses reinforcement learning and a neural network. Here are Zoltar’s predictions for week #3 of the 2021 season. These predictions are tentative, in the sense that it usually takes Zoltar about three weeks to hit his stride.

```Zoltar:    panthers  by    0  dog =      texans    Vegas:    panthers  by  7.5
Zoltar:       bills  by    6  dog =    redskins    Vegas:       bills  by    9
Zoltar:      browns  by    6  dog =       bears    Vegas:      browns  by  7.5
Zoltar:      ravens  by    0  dog =       lions    Vegas:      ravens  by    6
Zoltar:   cardinals  by    3  dog =     jaguars    Vegas:   cardinals  by  7.5
Zoltar:      chiefs  by    6  dog =    chargers    Vegas:      chiefs  by  6.5
Zoltar:      saints  by    0  dog =    patriots    Vegas:    patriots  by    3
Zoltar:      giants  by    5  dog =     falcons    Vegas:      giants  by    3
Zoltar:      titans  by    5  dog =       colts    Vegas:      titans  by  5.5
Zoltar:    steelers  by    6  dog =     bengals    Vegas:    steelers  by  4.5
Zoltar:     broncos  by    6  dog =        jets    Vegas:     broncos  by 10.5
Zoltar:     raiders  by    4  dog =    dolphins    Vegas:     raiders  by   10
Zoltar:    seahawks  by    0  dog =     vikings    Vegas:    seahawks  by    1
Zoltar:        rams  by    2  dog =  buccaneers    Vegas:  buccaneers  by    1
Zoltar:     packers  by    0  dog = fortyniners    Vegas: fortyniners  by    1
Zoltar:     cowboys  by    5  dog =      eagles    Vegas:     cowboys  by    4
```

Zoltar theoretically suggests betting when the Vegas line is “significantly” different from Zoltar’s prediction. In mid-season I use 3.0 points difference but for the first few weeks of the season I go a bit more conservative and use 4.0 points difference as the advice threshold criterion. At the beginning of the season, because of Zoltar’s initialization (all teams regress to an average power rating) and other algorithms, Zoltar is very strongly biased towards Vegas underdogs. I need to fix this.

1. Zoltar likes Vegas underdog Texans against the Panthers.
2. Zoltar likes Vegas underdog Lions against the Ravens.
3. Zoltar likes Vegas underdog Jaguars against the Cardinals.
4. Zoltar likes Vegas underdog Jets against the Broncos.
5. Zoltar likes Vegas underdog Dolphins against the Raiders.

Update: There were many relatively late point spread changes. I’ll deal with them later.

For example, a bet on the underdog Texans against the Panthers will pay off if the Texans win by any score, or if the favored Panthers win but by less than 7.5 points (in other words, win by 7 points or fewer).

Theoretically, if you must bet \$110 to win \$100 (typical in Vegas) then you’ll make money if you predict at 53% accuracy or better. But realistically, you need to predict at 60% accuracy or better.

In week #2, against the Vegas point spread, Zoltar went 4-6 (using 3.0 points as the advice threshold).

Just for fun, I track how well Zoltar does when just trying to predict just which team will win a game. This isn’t useful except for parlay betting. In week #2, just predicting the winning team, Zoltar went 8-8 which isn’t very good but is typical of the first few weeks of the season. In week #2, just predicting the winning team, Vegas — “the wisdom of the crowd” — went 11-5.

Zoltar sometimes predicts a 0-point margin of victory. There are five such games in week #3. In those situations, to pick a winner (only so I can track raw number of correct predictions) in the first few weeks of the season, Zoltar picks the home team to win. After that, Zoltar uses his algorithms to pick a winner. My prediction system is named after the Zoltar fortune teller machine you can find in arcades. Debugging software is kind of like being a detective. Unfortunately there are no crystal balls available to help.

## Researchers Explore Bayesian Neural Networks on Pure AI

I contributed to an article titled “Researchers Explore Bayesian Neural Networks” on the Pure AI web site. See https://pureai.com/articles/2021/09/07/bayesian-neural-networks.aspx. The agenda of the recently completed 2021 International Conference on Machine Learning (ICML) listed over 30 presentations related to the topic of Bayesian neural networks. The article explains what Bayesian neural networks are and why is there such great interest in them.

The term “Bayesian” loosely means “based on probability”. A Bayesian neural network (BNN) has weights and biases that are probability distributions instead of single fixed values. Each time a Bayesian neural network computes output, the values of the weights and biases will change slightly, and so the computed output will be slightly different every time. To make a prediction using a BNN, one approach is to feed the input to the BNN several times and average the results.

At first thought, Bayesian neural networks don’t seem to make much sense. However, BNNs have two advantages over standard neural networks. First, the built-in variability in BNNs makes them resistant to model overfitting. Model overfitting occurs when a neural network is trained too well. Even though the trained model predicts with high accuracy on the training data, when presented with new previously unseen data, the overfitted model predicts poorly. A second advantage of Bayesian neural networks over standard neural networks is that you can identify inputs where the model is uncertain of its prediction. For example, if you feed an input to a Bayesian neural network five times and you get five very different prediction results, you can treat the prediction as an “I’m not sure” result. The screenshot shows an example of a Bayesian neural network in action on the well-known Iris Dataset. The goal is to predict the species (0 = setosa, 1 = versicolor, 2 = virginica) of an iris flower based on sepal length and width, and petal length and width. A sepal is a leaf-like structure. After the Bayesian neural network was trained, it was fed an input of [5.0, 2.0, 3.0, 2.0] three times. The first output was [0.0073, 0.8768, 0.1159]. These are probabilities of each class. Because the largest probability value is 0.8768 at index , the prediction is class 1 = versicolor.

Even though I didn’t say so in the article, I’m mildly skeptical about Bayesian neural networks. The idea has a feel of a solution in search of a problem — something that’s very common in research. But this isn’t completely bad. Research needs to work in two ways: 1.) start with a problem and then find a way to solve it, and 2.) start with an idea and then find a problem that can be solved with it. Gambling is Bayesian. I always enjoy gambling scenes in science fiction. Left: A scene from the “Star Trek: The Next Generation” TV show (1987-1994). Center: A scene in the casino town of Canto Bight from “Star Wars: The Last Jedi” (2017). Right: Actor Justin Timberlake plays poker for his life in “In Time” (2011).

Posted in Machine Learning | 1 Comment

## A Quick Demo of the DBSCAN Clustering Algorithm

I was reading a research paper this morning and the paper used the DBSCAN (“density-based spatial clustering of applications with noise”) clustering algorithm. DBSCAN is somewhat similar to k-means clustering. Both work only with strictly numeric data.

In k-means you must specify the number of clusters. DBSCAN doesn’t require you to specify the number of clusters, but for DBSCAN you must specify an epsilon value (how close is “close”) and a minimum number of points that constitute a core cluster. These two DBSCAN parameters implicitly determine the number of clusters. I hadn’t used DBSCAN in a long time so I coded up a quick demo to refresh my memory. Implementing DBSCAN from scratch isn’t too difficult (I’ve done so using the C# language), but the scikit-learn Python language code library has a built-in implementation that’s simple and easy to use. So my demo was based on the scikit documentation example.

I set up 20 items of dummy data. I used 2D data so that I could graph the results. Like most clustering algorithms, the source data must be normalized so that large magnitude items don’t swamp small magnitude items.

The clustering function assigns a cluster ID label to each data item: 0, 1, 2, etc. Items that don’t get assigned to a cluster get a label of -1 to indicate they are “noise”.

I used the documentation code to create a graph of the clustering results. The five red dots are class 0, six green dots are class 1, and four blue dots are class 2. Noise items are colored black.

Data items that belong to clusters can be “core points” or non-core points. The large dots are core points, the smaller dots are non-core points.

Good fun! A cluster of three clever photographs with a playing card theme by Serge Lutens (1942-). Lutens is maybe best known for his work in the 1980s for Shiseido, a Japanese cosmetics company.

Code:

```# dbscan_cluster.py

import numpy as np
from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt

print("\nBegin clustering demo with DBSCAN ")

X  = np.array([
[ 0.325, -0.595],
[ 0.507,  1.619],
[ 0.817,  1.895],
[ 1.147,  0.764],
[-1.285, -0.95 ],
[-1.237, -0.532],
[ 1.108,  1.248],
[-0.847, -0.722],
[ 0.124, -1.346],
[ 0.910, -0.227],
[ 0.310, -0.756],
[-1.384, -0.715],
[ 0.736,  1.15 ],
[ 0.511, -0.517],
[-1.081, -0.91 ],
[ 0.416,  1.252],
[-2.349, -0.42 ],
[-0.559, -1.161],
[ 0.806,  1.054],
[ 1.023, -0.133]])

print("\nX data: ")
print(X)

print("\nPerforming clustering ")
clustering = DBSCAN(eps=0.5, min_samples=4).fit(X)
print("Done ")

print("\nComputed cluster labels: ")
print(clustering.labels_)

print("\nIndices of core points: ")
print(clustering.core_sample_indices_)

n_clusters = np.max(clustering.labels_) + 1
counts = np.zeros(n_clusters+1, dtype=np.int64)  # noise
for i in range(len(X)):
lbl = clustering.labels_[i]
if lbl == -1:
counts[n_clusters] += 1
else:
counts[lbl] += 1
print("\nCluster counts, noise count: ")
print(counts)

print("\nDisplaying clusering: " )

dtype=bool)

unique_labels = set(clustering.labels_)
colors = [plt.cm.hsv(each) \
for each in np.linspace(0, 1, \
len(unique_labels))]
for k, col in zip(unique_labels, colors):
if k == -1:
col = [0, 0, 0, 1]  # noise = black

plt.plot(xy[:, 0], xy[:, 1], 'o',
markerfacecolor=tuple(col),
markeredgecolor='k', markersize=14)

plt.plot(xy[:, 0], xy[:, 1], 'o',
markerfacecolor=tuple(col),
markeredgecolor='k', markersize=6)
plt.show()

print("\nEnd demo ")
```

## Differential Evolution Optimization in Visual Studio Magazine

I wrote an article titled “Differential Evolution Optimization” in the September 2021 edition of the Microsoft Visual Studio Magazine. See https://visualstudiomagazine.com/articles/2021/09/07/differential-evolution-optimization.aspx. The most common type of optimization for neural network training is some form of stochastic gradient descent (SGD). SGD has many variations such as Adam (adaptive momentum estimation) and Adagrad (adaptive gradient). All SGD-based optimization algorithms use the Calculus derivative (gradient) of an error function. But there are alternative optimization techniques that don’t use gradients. Examples include bio-inspired optimization techniques such as genetic algorithms and particle swarm optimization and geometry-inspired techniques such as Nelder-Mead and spiral dynamics. My article explains how to implement a bio-inspired optimization technique called differential evolution optimization (DEO).

An evolutionary algorithm is any algorithm that loosely mimics biological evolutionary mechanisms such as mating, chromosome crossover, mutation and natural selection. Standard evolutionary algorithms can be implemented using dozens of specific techniques. Differential evolution is a special type of evolutionary algorithm that has a relatively well-defined structure:

```create a population of possible solutions
loop
for-each possible solution
pick three other random solutions
combine the three to create a mutation
combine curr solution with mutation = candidate
if candidate is better than curr solution then
replace current solution with candidate
end-if
end-for
end-loop
return best solution found
```

The “differential” term in “differential evolution” is somewhat misleading. Differential evolution does not use Calculus derivatives. The “differential” refers to a specific part of the algorithm where three possible solutions are combined to create a mutation, based on the difference between two of the possible solutions.

Differential evolution optimization was originally designed for use in electrical engineering problems. But DEO has received increased interest as a possible technique for training deep neural networks. The biggest disadvantage of DEO is performance. DEO typically takes much longer to train a deep neural network than standard stochastic gradient descent (SGD) optimization techniques. However, DEO is not subject to the SGD vanishing gradient problem. At some point in the future, it’s quite possible that advances in computing power (through quantum computing) will make differential evolution optimization a viable alternative to SGD training techniques. There are quite a few interesting science fiction movies that involve alien DNA altering evolution. Here are three, all from 1995. Left: In “Species”, scientists use instructions sent by aliens to splice alien DNA with human DNA. The result was not so good. Center: In “Mosquito”, an alien spacecraft crashes in a forest. A regular mosquito ingests some alien DNA and . . the result was not so good for campers in the area. Right: In “Village of the Damned”, 10 women are mysteriously impregnated by alien DNA. The resulting 10 children don’t turn out to be very friendly.

## Yet Another MNIST Example Using Keras

It’s a major challenge to keep up with the continuous changes to the Keras/TensorFlow neural code library (and the PyTorch library too). I recently upgraded my Keras installation to version 2.6 and so I’m going through all my standard examples to bring them up to date with the inevitable changes.

I was using a new desktop machine and so I had to install TensorFlow 2.6 (which contains Keras 2.6). I ran into unexpected problems when the “wrapt” sub-library refused to build correctly (the installation process builds a .whl file for wrapt instead of using a pre-built .whl file). I found a hack online that suggested issuing the command SET WRAPT_INSTALL_EXTENSIONS=false, before the command pip install tensorflow, and that magically worked. Somehow.

One of the standard examples is the MNIST image dataset. There are 70,000 simple images (60,000 training images and 10,000 test images). Each image has 28×28 pixels and is a handwritten digit from ‘0’ to ‘9’. Each pixel value is between 0 and 255. I used the built-in MNIST dataset from Keras but I could have loaded the raw MNIST data using np.loadtxt() or a similar function.

I used the Model() approach by defining separate layers and then passing the first and last layer to the Model() constructor. An alternative design is to use the Sequential() approach. I have no strong preference between Model() and Sequential() — it’s just syntax.

To save time, I only used 2 training epochs with a batch size of 100. In a non-demo scenario I’d use more epochs, but then have to watch for over-fitting.

After the model was trained, I set up a fake 28×28 image with one vertical bar, one horizontal bar, and one diagonal bar. The trained model predicted the fake image is a ‘5’ with pseudo-probability = 1.0000. Many machine learning systems aren’t very good at distinguishing between authentic and fake items. This has generated interest in ML systems that output a prediction plus a confidence score of some type.

Good fun! Left: A clever pseudo Chanel bag made from a paper grocery bag and a chain from a hardware store. Center: A whimsical pseudo Louis Vuitton bag, complete with misspelling. Right: A serious attempt at a Coach bag, but I don’t think it will fool many people.

Code below.

```# mnist_tfk.py
# MNIST using CNN
# Keras 2.6.0 in TensorFlow 2.6.0 ("_tfk")
# Anaconda3-2020.02  Python 3.7.6  Windows 10

import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'  # suppress warn

import numpy as np
import tensorflow as tf
from tensorflow import keras as K
import matplotlib.pyplot as plt

def main():
# 0. get started
print("\nBegin MNIST using Keras %s " % K.__version__)
np.random.seed(1)
tf.random.set_seed(1)

(train_x, train_y), \
train_x = train_x.reshape(60_000, 28, 28, 1)
test_x = test_x.reshape(10_000, 28, 28, 1)
train_x = train_x.astype(np.float32)
test_x = test_x.astype(np.float32)
train_x /= 255
test_x /= 255
train_y = K.utils.to_categorical(train_y, 10)
test_y = K.utils.to_categorical(test_y, 10)

# 2. define model
print("\nCreating network with two Convolution, \
two Dropout, two Dense layers ")
g_init = K.initializers.glorot_uniform(seed=1)

x = K.layers.Input(shape=(28,28,1))
con1 = K.layers.Conv2D(filters=32,
kernel_size=(3,3), kernel_initializer=g_init,
con2 = K.layers.Conv2D(filters=64,
kernel_size=(3,3), kernel_initializer=g_init,
mp1 = K.layers.MaxPooling2D(pool_size=(2,2))(con2)
do1 = K.layers.Dropout(0.25)(mp1)
z = K.layers.Flatten()(do1)
fc1 = K.layers.Dense(units=128,
kernel_initializer=g_init, activation='relu')(z)
do2 = K.layers.Dropout(0.5)(fc1)
fc2 = K.layers.Dense(units=10,
kernel_initializer=g_init, activation='softmax')(do2)

model = K.models.Model(x, fc2)

model.compile(loss='categorical_crossentropy',
optimizer=opt, metrics=['accuracy'])

# 3. train model
bat_size= 100
max_epochs = 2
print("\nStarting training with batch size = %d " % bat_size)
model.fit(train_x, train_y, batch_size=bat_size,
epochs=max_epochs, verbose=1)
print("Training finished ")

# 4. evaluate model
eval = model.evaluate(test_x, test_y, verbose=0)
loss = eval
acc = eval * 100
print("\nTest data: loss = %0.4f \
accuracy = %0.2f%%" % (loss, acc))

# 5. save model
print("\nSaving MNIST model to disk ")
# mp = ".\\Models\\mnist_model.h5"
# model.save(mp)

# 6. use model
print("\nMaking prediction for fake image: ")
# np.set_printoptions(precision=4, suppress=True)
np.set_printoptions(formatter={'float': '{: 0.4f}'.format})

x = np.zeros(shape=(28,28), dtype=np.float32)
for row in range(5,23):
x[row] = 180  # vertical line
for rc in range(9,19):
x[rc][rc] = 250  # diagonal
for col in range(5,15):
x[col] = 200  # horizontal

plt.imshow(x, cmap=plt.get_cmap('gray_r'))
plt.show()

x = x.reshape(1, 28, 28, 1)
pred_probs = model.predict(x)
print("\nPrediction probabilities: ")
print(pred_probs)

pred_digit = np.argmax(pred_probs)
print("\nPredicted digit: ")
print(pred_digit)

print("\nEnd MNIST demo ")

if __name__ == "__main__":
main()
```
Posted in Keras | 1 Comment

## NFL 2021 Week 2 Predictions – Zoltar Likes Eight Vegas Underdogs

Zoltar is my NFL football prediction computer program. It uses reinforcement learning and a neural network. Here are Zoltar’s predictions for week #2 of the 2021 season. These predictions are tentative, in the sense that it usually takes Zoltar about three weeks to hit his stride.

```Zoltar:    redskins  by    4  dog =      giants    Vegas:    redskins  by    3
Zoltar:      saints  by    0  dog =    panthers    Vegas:      saints  by  3.5
Zoltar:       bears  by    4  dog =     bengals    Vegas:       bears  by    3
Zoltar:      browns  by    6  dog =      texans    Vegas:      browns  by 12.5
Zoltar:       colts  by    2  dog =        rams    Vegas:        rams  by    4
Zoltar:     broncos  by    0  dog =     jaguars    Vegas:     broncos  by    6
Zoltar:    dolphins  by    2  dog =       bills    Vegas:       bills  by  3.5
Zoltar:    patriots  by    0  dog =        jets    Vegas:    patriots  by    6
Zoltar:      eagles  by    2  dog = fortyniners    Vegas: fortyniners  by  3.5
Zoltar:    steelers  by    6  dog =     raiders    Vegas:    steelers  by    6
Zoltar:   cardinals  by    6  dog =     vikings    Vegas:   cardinals  by  4.5
Zoltar:  buccaneers  by    6  dog =     falcons    Vegas:  buccaneers  by 12.5
Zoltar:    chargers  by    6  dog =     cowboys    Vegas:    chargers  by    3
Zoltar:    seahawks  by    6  dog =      titans    Vegas:    seahawks  by  5.5
Zoltar:      chiefs  by    0  dog =      ravens    Vegas:      chiefs  by    4
Zoltar:     packers  by    6  dog =       lions    Vegas:     packers  by 10.5
```

Zoltar theoretically suggests betting when the Vegas line is “significantly” different from Zoltar’s prediction. In mid-season I use 3.0 points difference but for the first few weeks of the season I go a bit more conservative and use 4.0 points difference as the advice threshold criterion. At the beginning of the season, because of Zoltar’s initialization (all teams regress to an average power rating) and other algorithms, Zoltar is very strongly biased towards Vegas underdogs. I need to fix this.

1. Zoltar likes Vegas underdog Texans against the Browns.
2. Zoltar likes Vegas underdog Colts against the Rams.
3. Zoltar likes Vegas underdog Jaguars against the Broncos.
4. Zoltar likes Vegas underdog Dolphins against the Bills.
5. Zoltar likes Vegas underdog Jets against the Patriots.
6. Zoltar likes Vegas underdog Eagles against the 49ers.
7. Zoltar likes Vegas underdog Falcons against the Buccaneers.
8. Zoltar likes Vegas underdog Lions against the Packers.

For example, a bet on the underdog Texans against the Browns will pay off if the Texans win by any score, or if the favored Browns win but by less than 12.5 points (in other words, win by 12 points or fewer).

Theoretically, if you must bet \$110 to win \$100 (typical in Vegas) then you’ll make money if you predict at 53% accuracy or better. But realistically, you need to predict at 60% accuracy or better.

In week #1, against the Vegas point spread, Zoltar went 4-2 (using 4.0 points as the advice threshold). Zoltar was correct in recommending Vegas underdogs Lions (thanks to a late spread change plus the 49ers giving up 16 points in the final few minutes), Texans, Cardinals, Raiders. Zoltar was wrong in recommending underdogs Colts and Giants.

Just for fun, I track how well Zoltar does when just trying to predict just which team will win a game. This isn’t useful except for parlay betting. In week #1, just predicting the winning team, Zoltar went 8-8 which isn’t very good but is typical of the first few weeks of the season. In week #1, just predicting the winning team, Vegas — “the wisdom of the crowd” — went 9-7.

Zoltar sometimes predicts a 0-point margin of victory. There are four such games in week #2. In those situations, to pick a winner (only so I can track raw number of correct predictions) in the first few weeks of the season, Zoltar picks the home team to win. After that, Zoltar uses his algorithms to pick a winner. Zoltar uses machine learning rather than a crystal ball. Left: “The Wizard of Oz” (1933). Center: “Labyrinth” (1986). Right: “Harry Potter and the Prisoner of Azkaban” (2004).

Posted in Zoltar | 1 Comment

## Determining If Two Sentences Are Paraphrases Of Each Other Using Hugging Face

Deep neural systems based on Transformer Architecture (TA) have revolutionized the field of natural language processing (NLP). Unfortunately, TA systems are insanely complex, meaning that implementing a TA system from scratch is not feasible, and implementing TA using a low-level library like PyTorch or or Keras or TensorFlow is only barely feasible. The Hugging Face library (I hate that name . .) is a high-level code library (but like the library . .) that makes writing TA systems simple — with the downside that customizing a TA system built on Hugging Face can be very difficult.

I recently started work on a speculative project that will use a TA system. In our first team meeting, we decided that our initial approach will be to start with a Hugging Face model and then attempt to customize it, rather than try to build the system using PyTorch or Keras.

Even though I’ve been a software developer for many years, I forgot how to tackle the project. I incorrectly started by looking at all the Hugging Face technical documentation. I quickly got overwhelmed. After taking a short break, I remembered how I learn technology topics — from specific to general. In other words, I learn best by looking at many small, concrete examples. Over time, I learn the big picture. This is in sharp contrast to how some people learn — from general to specific. Those people start by learning the big picture and then learn how to construct concrete examples.

So, my plan is to look at one or two concrete examples of Hugging Face code every day or so. I know from previous experience that it’s important to have buffer time between explorations. My brain can only accept so much technical information until the effect of psychological interference starts — new information bounces off and interferes with old information.

My first example was a paraphrase analysis. Briefly, two sentences are paraphrases if the essentially mean the same thing. I set up two sentences:

```phrase_0 = "Machine Learning (ML) makes predictions from data"
phrase_1 = "ML uses data to compute a prediction."
```

Although the concept of paraphrases is somewhat subjective, most people would say the two sentences are in fact paraphrases of each other. The demo program is remarkably short because the Hugging Face library is so high-level. The demo emitted two associated pseudo-probabilities: the probability that the sentences are not paraphrases, and the probability that the sentences are paraphrases. The pseudo-probability values were [0.058, 0.942] so the model strongly believed the two sentences are in fact paraphrases.

Next step: another concrete Hugging Face example. And then another, and another until the big picture gels in my head. According to the Google Image Search Similarity tool, these three portraits are artistic paraphrases of each other. Left: Russian woman. Center: Spanish woman. Right: Italian woman.

```# paraphrase_test.py

from transformers import AutoTokenizer,
AutoModelForSequenceClassification
import torch

print("\nBegin HugFace paraphrase example ")

toker =
AutoTokenizer.from_pretrained \
("bert-base-cased-finetuned-mrpc")
model =
AutoModelForSequenceClassification.from_pretrained \
("bert-base-cased-finetuned-mrpc")

phrase_0 = "Machine Learning (ML) makes predictions from data"
phrase_1 = "ML uses data to compute a prediction."
print("\nFirst phrase: ")
print(phrase_0)
print("\nSecond phrase: ")
print(phrase_1)

phrases = toker(phrase_0, phrase_1, return_tensors="pt")
# print(type(phrases))
# 'transformers.tokenization_utils_base.BatchEncoding'
# derived from a Dictionary

result_logits = model(**phrases).logits
result_probs = torch.softmax(result_logits, dim=1).numpy()

print("\nPseudo-probabilities of not-a-para, is-a-para: ")
print(result_probs)

print("\nEnd HugFace example ")
```

## A Simplified Approach for Ordinal Classification

In a standard classification problem, the goal is to predict a class label. For example, in the Iris Dataset problem, the goal is to predict a species of flower: 0 = “setosa”, 1 = “versicolor”, 2 = “virginica”. Here the class labels are just labels wthout any meaning attache to the order. In an ordinal classification problem (also called ordinal regression), the class labels have order. For example, you might want to predict the median price of a house in one of 506 towns, where price can be 0 = very low, 1 = low, 2 = medium, 3 = high, 4 = very high. For an ordinal classification problem, you could just use standard classification, but that approach doesn’t take advantage of the ordering information in the training data. I coded up a demo of a simple technique using the PyTorch code library. The same technique can be used with Keras/TensorFlow too.

I used a modified version of the Boston Housing dataset. There are 506 data items. Each item is a town near Boston. There are 13 predictor variables — crime rate in town, tax rate in town, proportion of Black residents in town, and so on. The original Boston dataset contains the median price of a house in each town, divided by \$1,000 — like 35.00 for \$35,000 (the data is from the 1970s when house prices were low). To convert the data to an ordinal classification problem, I mapped the house prices like so:

```       price          class  count
[\$0      to \$10,000)    0      24
[\$10,000 to \$20,000)    1     191
[\$20,000 to \$30,000)    2     207
[\$30,000 to \$40,000)    3      53
[\$40,000 to \$50,000]    4      31
---
506

```

I normalized the numeric predictor values by dividing by a constant so that each normalized value is between -1.0 and +1.0. I encoded the single Boolean predictor value (does town border the Charles River) as -1 (no), +1 (yes).

The technique I used for ordinal classification is something I invented myself, at least as far as I know. I’ve never seen the technique I used anywhere else, but it’s not too complicated and so it could exist under an obscure name of some sort.

For the modified Boston Housing dataset there are k = 5 classes. The class target values in the training data are (0, 1, 2, 3, 4). My neural network system outputs a single numeric value between 0.0 and 1.0 — for example 0.2345. The class target values of (0, 1, 2, 3, 4) generate associated floating point sub-targets of (0.1, 0.3, 0.5, 0.7, 0.9). When I read the data into memory as a PyTorch Dataset object, I map each ordinal class label to the associated floating point target. Then I use standard MSELoss() to train the network.

Suppose a data item has class label = 3 (high price). The target value for that item is stored as 0.7. The computed predicted price will be something like 0.66 (close to target, so low MSE error and a correct prediction) or maybe 0.23 (far from target, so high MSE error and a wrong prediction). With this scheme, the ordering information is used.

For implementation, most of the work is done inside the Dataset object:

```class BostonDataset(T.utils.data.Dataset):
# features are in cols [0,12], median price as int in 

def __init__(self, src_file, k):
# k is for class_to_target_program()

n = len(tmp_y)
float_targets = np.zeros(n, dtype=np.float32)  # 1D

for i in range(n):  # hard-coded is easy to understand
if tmp_y[i] == 0: float_targets[i] = 0.1
elif tmp_y[i] == 1: float_targets[i] = 0.3
elif tmp_y[i] == 2: float_targets[i] = 0.5
elif tmp_y[i] == 3: float_targets[i] = 0.7
elif tmp_y[i] == 4: float_targets[i] = 0.9
else: print("Fatal logic error ")

float_targets = np.reshape(float_targets, (-1,1))  # 2D

self.x_data = \
T.tensor(tmp_x, dtype=T.float32).to(device)
self.y_data = \
T.tensor(float_targets, dtype=T.float32).to(device)

def __len__(self):
return len(self.x_data)

def __getitem__(self, idx):
preds = self.x_data[idx]  # all cols
price = self.y_data[idx]  # all cols
return (preds, price)     # tuple of two matrices
```

There are a few minor, but very tricky details. They’d take much too long too explain in a blog post, so I’ll just say that if you’re interested, examine the code very carefully. I don’t think it’s possible to assign a strictly numeric value to art. Here are two clever illustrations by artist Casimir Lee. I like the bright colors and combination of 1920s art deco style with 1960s psychedelic style.