NFL 2021 Week 8 Predictions – Zoltar Likes Five Underdogs

Zoltar is my NFL football prediction computer program. It uses reinforcement learning and a neural network. Here are Zoltar’s predictions for week #8 of the 2021 season. It usually takes Zoltar about four weeks to hit his stride and takes humans about eight weeks to get up to speed, so weeks six through nine are usually Zoltar’s sweet spot.

Zoltar:   cardinals  by    2  dog =     packers    Vegas:   cardinals  by  3.5
Zoltar:     falcons  by    4  dog =    panthers    Vegas:     falcons  by  2.5
Zoltar:       bills  by    9  dog =    dolphins    Vegas:       bills  by 11.5
Zoltar:       bears  by    6  dog = fortyniners    Vegas: fortyniners  by  3.5
Zoltar:      browns  by    2  dog =    steelers    Vegas:      browns  by    3
Zoltar:      titans  by    0  dog =       colts    Vegas:       colts  by  2.5
Zoltar:      eagles  by    0  dog =       lions    Vegas:      eagles  by    4
Zoltar:        rams  by    4  dog =      texans    Vegas:        rams  by   14
Zoltar:     bengals  by    2  dog =        jets    Vegas:     bengals  by  4.5
Zoltar:    chargers  by    6  dog =    patriots    Vegas:    chargers  by  5.5
Zoltar:    seahawks  by    7  dog =     jaguars    Vegas:    seahawks  by  3.5
Zoltar:     broncos  by    2  dog =    redskins    Vegas:     broncos  by  3.5
Zoltar:  buccaneers  by    0  dog =      saints    Vegas:  buccaneers  by    4
Zoltar:     cowboys  by    0  dog =     vikings    Vegas:     cowboys  by  2.5
Zoltar:      chiefs  by    6  dog =      giants    Vegas:      chiefs  by 10.5

Zoltar theoretically suggests betting when the Vegas line is “significantly” different from Zoltar’s prediction. In mid-season I usually use 3.0 points difference but for the first few weeks of the season I go a bit more conservative and use 4.0 points difference as the advice threshold criterion. In middle weeks I sometimes go ultra-aggressive and use a 1.0-point threshold.

Note: Because of Zoltar’s initialization (all teams regress to an average power rating) and other algorithms, Zoltar is much too strongly biased towards Vegas underdogs. I need to fix this.

For week #8:

1. Zoltar likes the Vegas underdog Bears against the 49ers.
2. Zoltar likes the Vegas underdog Lions against the Eagles.
3. Zoltar likes the Vegas underdog Texans against the Rams.
4. Zoltar likes the Vegas favorite Seahawks over the Jaguars.
5. Zoltar likes the Vegas underdog Saints against the Buccaneers.
6. Zoltar likes the Vegas underdog Gants against the Chiefs.

For example, a bet on the underdog Bears against the 49ers will pay off if the Bears win by any score, or if the favored 49ers win but by less than the point spread of 3.5 points (in other words, win by 3 points or less).

Theoretically, if you must bet $110 to win $100 (typical in Vegas) then you’ll make money if you predict at 53% accuracy or better. But realistically, you need to predict at 60% accuracy or better.

In week #7, against the Vegas point spread, Zoltar went 5-2 (using the aggressive 1.0 points as the advice threshold). Overall, for the season, Zoltar is 30-23 against the spread (56%).

Just for fun, I track how well Zoltar does when just trying to predict just which team will win a game. This isn’t useful except for parlay betting. In week #7, just predicting the winning team, Zoltar went 10-3 which is slightly better than average.

In week #7, just predicting the winning team, Vegas — “the wisdom of the crowd” — went 8-5, which is a bit worse than average.

Zoltar sometimes predicts a 0-point margin of victory, which means the two teams are evenly matched. There are four such games in week #8. In those situations, to pick a winner (only so I can track raw number of correct predictions) in the first few weeks of the season, Zoltar picks the home team to win. After that, Zoltar uses his algorithms to pick a winner.



My system is named after the Zoltar fortune teller machine you can find in arcades. Zoltar is a popular Halloween costume too.


Posted in Zoltar | Leave a comment

Xenobots: Tiny Bio-Robots Designed Using Machine Learning

I ran into a truly fascinating research paper recently that described “xenobots”. Briefly, a xenobot is a tiny (about 4 one-hundredths of an inch in diameter — about the size of a grain of sand) programmable bio-robot made from frog skin and heart cells. This image illustrates the key ideas:

The green objects are frog skin cells that provide the xenobot structure. The red objects are frog heart muscle cells that contract and expand, and provide the xenobot motion. The first step in creating a xenobot is to design a model using evolutionary algorithm machine learning. The design target is for the xenobot to perform a specific task, such as walking, pushing pellets, carrying payloads, and working together in a swarm to aggregate debris.

Once the abstract model has been created, the physical xenobot is manufactured using frog cells. Xenobots can survive for weeks without food and can heal themselves after damage.

Astonishing!



Nanobots are mechanical robots that are very small — about the size of 10 atoms. I don’t think practical nanobots exist yet.


Posted in Machine Learning | 1 Comment

Reading IMDB Movie Review Dataset Files

I was working on the well-known IMDB movie review sentiment analysis problem The goal is to create a machine learning model that accepts the text of a movie review and predicts if the review is positive (class 1) or negative (class 0).

For experimentation I created a tiny dataset with just 8 reviews: 2 training positive, 2 training negative, 2 test positive, 2 test negative. I used the same structure as the full 25,000-review dataset. The root directory has two subdirectories named “pos” and “neg”. Each subdirectory has individual text files, one file per review.

A major challenge when working with ML is reading data files into memory. I experimented with two different approaches, classic and modern. The classic technique uses the os library along with the os.listdir() function. The classic technique is clear but is brittle because I hard-code directory paths using Windows “\\” separators.

The modern technique uses the Path library along with the iterdir() method. The modern technique is short and efficient but the code is a bit obscure.

The bottom line is that the modern technique is preferable in most cases.



Left: Classic space suits from “Destination Moon” (1950). Center: Neo-modern space suit from “2001: A Space Odyssey” (1968). Right: Modern space suit from “Armageddon” (1998).


Demo code:

# read_imdb_files.py

import os                 # classic
from pathlib import Path  # modern

def read_imdb_classic(root_dir):
  reviews = []; labels = []
  for label_dir in ["pos", "neg"]:
    dir = root_dir + "\\" + label_dir
    for fname in os.listdir(dir):
      full_name = dir + "\\" + fname
      with open(full_name, 'r', encoding='utf-8') as f:
        txt = f.read()
        reviews.append(txt)
        if label_dir == "pos":
          labels.append(1)
        else:
          labels.append(0)
  return (reviews, labels)

def read_imdb_modern(root_dir):
  reviews = []; labels = []
  root_dir = Path(root_dir)
  for label_dir in ["pos", "neg"]:
    for f_handle in (root_dir/label_dir).iterdir():
      reviews.append(f_handle.read_text(encoding='utf-8'))
      if label_dir == "pos":
        labels.append(1)
      else:
        labels.append(0)
  return (reviews, labels)

print("\nBegin reading IMDB files demo ")
  
root_dir = ".\\DataTiny\\aclImdb\\train"

print("\nReading IMDB files classic technique: \n")
(reviews, labels) = read_imdb_classic(root_dir)
print(reviews); print(labels)

print("\nReading IMDB files modern technique: \n")
(reviews, labels) = read_imdb_modern(root_dir)
print(reviews); print(labels)

print("\nEnd demo ")
Posted in Machine Learning | Leave a comment

The Best Algorithm I’ve Discovered for Positive and Unlabeled Learning (PUL)

A positive and unlabeled learning (PUL) problem occurs when a machine learning set of training data has only a few positive (class 1) labeled items and many unlabeled (could be either negative class 0, or positive class 1) items. For example, you might have a dataset of security information where there are only a few dozen network attacks (class 1) but many thousands of items where it’s not known if each is class 0 or class 1.

The goal of PUL is to use the information contained in the dataset to guess the true labels of the unlabeled data items. After the class labels of some of the unlabeled items have been guessed, the resulting labeled dataset can be used to train a binary classification model using any standard machine learning technique, such as k-nearest neighbors classification or neural binary classification.

PUL is very difficult and there are many techniques. But none of the published techniques I’ve seen were completely convincing in my opinion.

Several months ago I came up with an algorithm that gave me the best results I’d seen in my explorations. My algorithm was based on an algorithm I found in a rather obscure research paper. My adaptation algorithm worked well but in the back of my mind I wasn’t 100% satisfied. For one thing, the algorithm randomly selects some of the unlabeled items and temporarily assigns them as class 0 negative. Why negative class 0 rather than positive class 1? Also, the algorithm did not extend from binary classification problems to multi-class classification problems. That characteristic just didn’t feel right.


A screenshot of the older algorithm that I extended.

The ideas percolated in my head for many weeks, especially when I was walking my dogs. On one such walk I came up with an extension of my old algorithm and knew for sure it would work before I even coded up a demo experiment.

The technique is moderately complicated. A neural binary classifier accepts inputs and produces a prediction in the form of a p-value between 0.0 and 1.0 where values less than 0.5 indicate a prediction of negative class 0 and values greater than 0.5 indicate positive class 1. The key to my idea is to repeatedly train neural models with three types of data: all available positive class 1 items, an equal number of random noise items labeled as negative class 0, and an equal number of randomly selected unlabeled items, temporarily labeled as negative class 0, and then train additional models where the randomly selected unlabeled items are now temporarily labeled as positive class 1.

The existing positive class 1 training items provide information about what data patterns are positive. The random noise items are highly likely to be negative class 0 so they give information abut negative data item patterns. The randomly selected unlabeled items give information about data item patterns in general, and they could be either negative class 0 or positive class 1 so they are treated in both ways.

Suppose you have a dataset with 200 total training item. Of these, 20 are positive class 1. The remaining 180 data items are unlabeled and could be either negative class 0 or positive class 1.

phase 0:    
loop several times
  create a 60-item train dataset with all 20 positive,
  20 random noise labeled as negative class 0,
  and 20 randomly selected unlabeled items that
  are temporarily treated as negative class 0

  train a binary classifier using the 60-item train data

  use trained model to score the 160 unused unlabeled
    data items

  accumulate the p-score for each unused unlabeled item
    
  generate a new train dataset, train, score
end-loop

phase 1:
(same as phase 0 except randomly selected unlabeled items
 are now temporarily treated as positive class 1)

loop several times
  create a 60-item train dataset with all 20 positive,
  20 random noise labeled as negative class 0,
  and 20 randomly selected unlabeled items that
  are temporarily treated as positive class 1

  train a binary classifier using the 60-item train data

  use trained model to score the 160 unused unlabeled
    data items

  accumulate the p-score for each unused unlabeled item
    
  generate a new train dataset, train, score
end-loop
  
for-each of the 180 unlabeled items
  compute the average p-value from phases 0 and 1

  if avg p-value  hi threshold
    guess its label as positive
  else
    insufficient evidence to make a guess
  end-if
end-for

I coded up a demo. The code was surprisingly long and tricky. But this algorithm worked better than any other PUL technique I’ve tried. However, there are many hyperparameters so it’s not possible to state with certainty this algorithm is best in any sense. But conceptually it feels absolutely correct. Notice that the algorithm naturally extends to multi-class classification by adding phase 2, phase 3, and so on.



Unlabeled data hides its true underlying nature. During the covid pandemic, face masks hide the true nature of people’s appearance. Here are three covid face masks with questionable functionality. Left: Good filter for an automobile but not so good for a person. Center: Looks good but probably isn’t very effective. Right: I suspect the face mask here was chosen for reasons other than medical ones.


Posted in Machine Learning | Leave a comment

A Predict-Next-Word Example Using Hugging Face and GPT-2

Deep neural transformer architecture (TA) systems can be considered the successors to LSTM (long, short-term memory) networks. TAs have revolutionized the field of natural language processing (NLP). Unfortunately, TA systems are extremely complicated and implementing a TA system from scratch can take weeks or months.

The Hugging Face (HF) code library wraps TAs and makes them relatively easy to use.

I’ve been walking through the HF documentation examples. I take an example and then refactor it completely. Doing so forces me to understand every line of code. Over time, by repeating this process for many examples, I expect to gain a solid grasp of the HF library.

My latest experiment was to refactor the example that does a “next-word” prediction. You feed the model a sequence of words and the model predicts the next word. For my demo, I set up a sequence of:

“Machine learning with PyTorch can do amazing . . ”

The built-in model predicted the next word is “things” which seems reasonable.

The documentation example wasn’t very good in my opinion because instead of predicting the single, most likely word, the example fetched all possible words (50,257 of them) did some complicated filtering using the HF top_k_top_p_flitering() function, then fed those filtered results to the PyTorch multinomial() probability distribution function, and selected one highly-likely, but not necessarily most-likely, result. My point is that the documentation example had too many clever bells and whistles which obscured the main ideas of the next-word prediction.

Note: The system doesn’t really predict a next “word” — it’s more correct to say the model prediction is a “token”. For example, the tokenizer breaks the word “PyTorch” into “Py”, “Tor”, and “ch” tokens.

Even though the documentation example was short, it is extremely dense. Every statement has many nuances and ideas. Parsing through the documentation example took me a full day, and there are still some details I don’t fully understand. But it was good fun and the adventure took me one step closer to a working knowledge of the HF library for transformer architecture systems.



I used to like to watch the Roadrunner and Coyote cartoons. The Coyote always had a new plan to catch the Roadrunner, and the fun was predicting how the next plan would fail — no transformer architecture needed.


Demo code:

# next_word_test.py

import torch
from transformers import AutoModelForCausalLM, \
  AutoTokenizer
# from torch import nn
import numpy as np

print("\nBegin next-word using HF GPT-2 demo ")

toker = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

seq = "Machine learning with PyTorch can do amazing"
print("\nInput sequence: ")
print(seq)

inpts = toker(seq, return_tensors="pt")
print("\nTokenized input data structure: ")
print(inpts)

inpt_ids = inpts["input_ids"]  # just IDS, no attn mask
print("\nToken IDs and their words: ")
for id in inpt_ids[0]:
  word = toker.decode(id)
  print(id, word)

with torch.no_grad():
  logits = model(**inpts).logits[:, -1, :]
print("\nAll logits for next word: ")
print(logits)
print(logits.shape)

pred_id = torch.argmax(logits).item()
print("\nPredicted token ID of next word: ")
print(pred_id)

pred_word = toker.decode(pred_id)
print("\nPredicted next word for sequence: ")
print(pred_word)

print("\nEnd demo ")
Posted in Machine Learning | Leave a comment

NFL 2021 Week 7 Predictions – Zoltar Likes the Raiders to Cover Against the Eagles

Zoltar is my NFL football prediction computer program. It uses reinforcement learning and a neural network. Here are Zoltar’s predictions for week #7 of the 2021 season. It usually takes Zoltar about four weeks to hit his stride and takes humans about eight weeks to get up to speed, so weeks six through nine are usually Zoltar’s sweet spot.

Zoltar:      browns  by    6  dog =     broncos    Vegas:      browns  by    6
Zoltar:     packers  by    8  dog =    redskins    Vegas:     packers  by  7.5
Zoltar:    dolphins  by    4  dog =     falcons    Vegas:    dolphins  by    3
Zoltar:    patriots  by    6  dog =        jets    Vegas:    patriots  by    7
Zoltar:    panthers  by    0  dog =      giants    Vegas:    panthers  by    3
Zoltar:      titans  by    2  dog =      chiefs    Vegas:      chiefs  by    3
Zoltar:      ravens  by    6  dog =     bengals    Vegas:      ravens  by    7
Zoltar:     raiders  by    6  dog =      eagles    Vegas:     raiders  by  2.5
Zoltar:        rams  by   10  dog =       lions    Vegas:        rams  by 13.5
Zoltar:   cardinals  by   10  dog =      texans    Vegas:   cardinals  by 14.5
Zoltar:  buccaneers  by    6  dog =       bears    Vegas:  buccaneers  by   10
Zoltar:       colts  by    0  dog = fortyniners    Vegas: fortyniners  by  3.5
Zoltar:      saints  by    0  dog =    seahawks    Vegas:      saints  by    3

Zoltar theoretically suggests betting when the Vegas line is “significantly” different from Zoltar’s prediction. In mid-season I use 3.0 points difference but for the first few weeks of the season I go a bit more conservative and use 4.0 points difference as the advice threshold criterion. In middle weeks I sometimes go ultra-aggressive and use a 1.0-point threshold.

Note: Because of Zoltar’s initialization (all teams regress to an average power rating) and other algorithms, Zoltar is much too strongly biased towards Vegas underdogs. I need to fix this.

1. Zoltar likes Vegas underdog Titans against the Chiefs
2. Zoltar likes Vegas favorite Raiders over the Eagles
3. Zoltar likes Vegas underdog Lions against the Rams
4. Zoltar likes Vegas underdog Texans against the Cardinals
5. Zoltar likes Vegas underdog Bears against the Buccaneers
6. Zoltar likes Vegas underdog Colts against the 49ers

For example, a bet on the underdog Lions against the Rams will pay off if the Lions win by any score, or if the favored Rams win but by less than the point spread of 13.5 points (in other words, win by 13 points or less).

Theoretically, if you must bet $110 to win $100 (typical in Vegas) then you’ll make money if you predict at 53% accuracy or better. But realistically, you need to predict at 60% accuracy or better.

In week #6, against the Vegas point spread, Zoltar went 7-5 (using the aggressive 1.0 points as the advice threshold).

Just for fun, I track how well Zoltar does when just trying to predict just which team will win a game. This isn’t useful except for parlay betting. In week #6, just predicting the winning team, Zoltar went 10-4 which is about average.

In week #6, just predicting the winning team, Vegas — “the wisdom of the crowd” — went 8-6.

Zoltar sometimes predicts a 0-point margin of victory, which means the two teams are evenly matched. There are three such games in week #7. In those situations, to pick a winner (only so I can track raw number of correct predictions) in the first few weeks of the season, Zoltar picks the home team to win. After that, Zoltar uses his algorithms to pick a winner.



My system is named after the Zoltar fortune teller machine you can find in arcades. That machine is named after the Zoltar machine from the 1988 movie “Big”. And the 1988 Zoltar was named after the “Zoltan” arcade fortune teller from the 1960s. I’ve always been fascinated by electro-mechanical arcade devices. Center: The “Mystic Ray” machine actually wrote out a fortune using a pen. Amazing tech for the time. Right: The “Zodi” machine actually typed out a fortune using a pneumatic powered typewriter. Also amazing.


Posted in Zoltar | Leave a comment

The Effects of COVID-19 on Business Collaboration on Pure AI

I contributed to an article titled “The Effects of COVID-19 on Business Collaboration” in the October 2021 edition of the online Pure AI web site. See https://pureai.com/articles/2021/10/04/covid-business-collaboration.aspx.

A group of Microsoft analysts investigated how work collaboration patterns at their company changed due to the ongoing COVID-19 pandemic. The results were published in an online article titled “The Effects of Remote Work on Collaboration Among Information Workers” by L. Yang, D. Holtz, et al. (September 9, 2021) on the Nature Human Behavior web site.

Data from approximately 60,000 employees during the period December 2019 to June 2020 was collected and examined. Not surprisingly, the data analysis revealed that the shift to remote work caused a reduction in the interconnections between both formal business groups and informal communities. This reduction in communication diversity has potential negative consequences for business creativity and innovation.



The number of between-group bridging ties decreased significantly after the onset of COVID-19 for traditional office workers.


An additional, somewhat surprising finding was that after the start of the COVID-19 pandemic, the total amount of time Microsoft employees spent in meetings actually decreased by approximately 5 percent compared to pre-pandemic levels. The analysts hypothesized that the reduction in time spent in meetings was possibly due to indirect factors, perhaps such as the increased amount of time a parent needed for at-home school and childcare.

I was quoted in the article, “The goal of the program I direct is to infuse advanced artificial intelligence and machine learning systems into products and services. These efforts require communicating very complex ideas along with coordinating complicated logistical information. We’re finding that real-time person-to-person communication is critically important for the design of innovative new systems and algorithms, but that email communication is highly effective for managing our logistical activities.”

And, “Many of my colleagues and I agree that unplanned impromptu conversations, often next to the workplace coffee machine, are extremely important for generating new ideas. Remote work largely eliminates this idea-generation channel.”



Business communication is mostly about the exchange of information. Art communication is entirely different. Three vaguely similar paintings in the Impressionist style. Left: By Berthe Morisot (1874–1892). Center: By Konstantin Razumov (b. 1974). Right: By Frederick Carl Frieseke (1874-1939).

Posted in Miscellaneous | Leave a comment

Principal Component Analysis (PCA) From Scratch vs. Scikit

A few days ago I coded up a demo of anomaly detection using principal component analysis (PCA) reconstruction error. I implemented the PCA functionality — computation of the transformed data, the principal components, and the variance explained by each component — from semi-scratch, meaning I used the NumPy linalg (linear algebra) library eig() function to compute eigenvalues and eigenvectors.

And it was good.

But in the back of my mind, I was thinking that I should have verified my semi-from-scratch implementation of PCA because PCA is very, very complex and I could have made a mistake.


The from-scratch version (left) and the scikit version (right) are identical except that some of the transformed vectors and principal components differ by a factor of -1. This doesn’t affect anything.

So I took my original from-scratch PCA anomaly detection program and swapped out the PCA implementation from the scikit sklearn.decomposition library. And as expected, the results of the scikit-based PCA program were identical to the results of the from-scratch PCA program. Almost.

My from-scratch code looks like:

import numpy as np

def my_pca(X):
  # returns transformed X, prin components, var explained
  dim = len(X[0])  # n_cols
  means = np.mean(X, axis=0)
  z = X - means  # avoid changing X
  square_m = np.dot(z.T, z)
  (evals, evecs) = np.linalg.eig(square_m) 
  trans_x = np.dot(z, evecs[:,0:dim]) 
  prin_comp = evecs.T  
  v = np.var(trans_x, axis=0, ddof=1) 
  sv = np.sum(v)
  ve = v / sv
  # order everything based on variance explained
  ordering = np.argsort(ve)[::-1]  # sort order high to low
  trans_x = trans_x[:,ordering]
  prin_comp = prin_comp[ordering,:]
  ve = ve[ordering]
  return (trans_x, prin_comp, ve)

X = (load data from somewhere)
(trans_x, p_comp, ve) = my_pca(X)

The scikit-based code looks like:

import numpy as np
import sklearn.decomposition

X = (load data from somewhere)
pca = sklearn.decomposition.PCA().fit(X)
trans_x = pca.transform(X)
p_comp = pca.components_
ve = pca.explained_variance_ratio_

All the results were identical except that the internal transformed X values and the principal components, sometimes differed by a factor of -1. As it turns out this is OK because PCA computes variances and the sign doesn’t affect variance.

The advantage of using scikit PCA is simplicity. The advantages of using PCA from scratch are 1.) you get fine-tuned control, 2.) you remove an external dependency, 3.) you aren’t using a mysterious black box.

PCA is interesting and sometimes useful, but for tasks like dimensionality reduction and reconstruction, deep neural techniques have largely replaced PCA.



PCA was developed in 1901 by famous statistician Karl Pearson. I wonder if statisticians of that era imagined today’s deep neural technologies. Three images from the movie “Things to Come” (1936) based on the novel of the same name by author H. G. Wells.


Demo code:

# pca_recon_skikit.py
# exactly replicates iris_pca_recon.py scratch version

import numpy as np
import sklearn.decomposition

def reconstructed(X, n_comp, trans_x, p_comp):
  means = np.mean(X, axis=0)
  result = np.dot(trans_x[:,0:n_comp], p_comp[0:n_comp,:])
  result += means
  return result

def recon_error(X, XX):
  diff = X - XX
  diff_sq = diff * diff
  errs = np.sum(diff_sq, axis=1)
  return errs

def main():
  print("\nBegin Iris PCA reconstruction using scikit ")
  np.set_printoptions(formatter={'float': '{: 0.1f}'.format})

  X = np.array([
    [5.1, 3.5, 1.4, 0.2],
    [5.4, 3.9, 1.7, 0.4],

    [6.4, 3.2, 4.5, 1.5],
    [5.7, 2.8, 4.5, 1.3],

    [7.2, 3.6, 6.1, 2.5],
    [6.9, 3.2, 5.7, 2.3]])

  print("\nSource X: ")
  print(X)

  print("\nPerforming PCA computations ")
  pca = sklearn.decomposition.PCA().fit(X)
  trans_x = pca.transform(X)
  p_comp = pca.components_
  ve = pca.explained_variance_ratio_
  print("Done ")

  print("\nTransformed X: ")
  np.set_printoptions(formatter={'float': '{: 0.4f}'.format})
  print(trans_x)

  print("\nPrincipal components: ")
  np.set_printoptions(formatter={'float': '{: 0.4f}'.format})
  print(p_comp)

  print("\nVariance explained: ")
  np.set_printoptions(formatter={'float': '{: 0.5f}'.format}) 
  print(ve)

  XX = reconstructed(X, 4, trans_x, p_comp)
  print("\nReconstructed X using all components: ")
  np.set_printoptions(formatter={'float': '{: 0.2f}'.format})
  print(XX)

  XX = reconstructed(X, 1, trans_x, p_comp)
  print("\nReconstructed X using one component: ")
  np.set_printoptions(formatter={'float': '{: 0.2f}'.format})
  print(XX)

  re = recon_error(X, XX)
  print("\nReconstruction errors using one component: ")
  np.set_printoptions(formatter={'float': '{: 0.3f}'.format})
  print(re)

  print("\nEnd PCA scikit ")

if __name__ == "__main__":
  main()
Posted in Machine Learning | Leave a comment

Ordinal Classification Using PyTorch in Visual Studio Magazine

I wrote an article titled “Ordinal Classification Using PyTorch” in the October 2021 edition of the Microsoft Visual Studio Magazine. See https://visualstudiomagazine.com/articles/2021/10/04/ordinal-classification-pytorch.aspx.

The goal of an ordinal classification problem is to predict a discrete value, where the set of possible values is ordered. For example, you might want to predict the price of a house (based on predictors such as area, number of bedrooms and so on) where the possible price values are 0 (low), 1 (medium), 2 (high), 3 (very high). Ordinal classification is different from a standard, non-ordinal classification problem where the set of values to predict is categorical and is not ordered. For example, predicting the exterior color of a car, where 0 = white,1 = silver, 2 = black and so on is a standard classification problem.

There are a surprisingly large number of techniques for ordinal classification. My article presents a relatively simple technique that I’ve used with good success. To the best of my knowledge the technique has not been published and does not have a standard name. Briefly, the idea is to programmatically convert ordinal labels such as (0, 1, 2, 3) to floating point targets such as (0.125, 0.375, 0.625, 0.875) and then use a neural network regression approach.

I explained using a specific demo example. The demo predicted the price of a house (0 = low, 1 = medium, 2 = high, 3 = very high) based on predictor variables (air conditioning, -1 = no, +1 = yes), normalized area in square feet (e.g., 0.2500 = 2,500 sq. feet), style (art_deco, bungalow, colonial) and local school (johnson, kennedy, lincoln).




The demo loaded a 200-item set of training data, and a 40-item set of test data into memory. During the loading process, the ordinal class labels (0, 1, 2, 3) were converted to float targets (0.125, 0.375, 0.625, 0.875). The mapping of ordinal labels to float targets is the key to the ordinal classification tecnique.

The demo created an 8-(10-10)-1 deep neural network. There are 8 input nodes (one for each predictor value after encoding house style and local school), two hidden layers with 10 processing nodes each and a single output node. The neural network emits an output value in the range 0.0 to 1.0, that corresponds to the float targets.

After training, the model achieved a prediction accuracy of 89.5 percent on the training data (179 of 200 correct) and 82.5 percent accuracy on the test data (33 of 40 correct). The demo concluded by making a prediction for a new, previously unseen house. The predictor values were air conditioning = no (-1), area = 0.2300 (2300 square feet), style = colonial (0, 0, 1) and local school = kennedy (0, 1, 0). The raw computed output value was 0.7215 which mapped to class label 2, which corresponded to an ordinal price of “high”.



The novel “Adventures of Huckleberry Finn” was published in 1885. It is almost universally regarded as one of the greatest American novels ever published — an ordinal rating of 5 on a scale of 1-5. Reading the book is required by nearly 100% of the top-rated high schools in the U.S. I read “Huck Finn” in high school and it was easily one of my favorite books.


Posted in PyTorch | Leave a comment

Nucleus Sampling for Natural Language Processing

I ran into an interesting idea called nucleus sampling, also called top-p sampling. Nucleus sampling is used for natural language processing (NLP) next-word prediction.

Suppose you have a sentence that starts with “I got up and ran to the . . ” and you have a prediction model that emits next possible words and their associated logits (raw output values from the model). Suppose the model predicts these 7 words, ordered from largest logit (most likely) to smallest logit: door, car, store, race, finish, monkey, apple.

The simplest way to pick the next word is to just select the most likely, “door”. Picking the most likely next word doesn’t work very well because it turns out that when you string several words together in this way, the generated text doesn’t seem human.

A more sophisticated way is to examine the top-k candidates and then randomly select one of those top-k words. For example, if k = 3 the top three next words are door, car, store and you’d use one of these, randomly selected (either uniform random or from a multinomial distribution), as the next word.

The problem with top-k selection is that it’s difficult to pick k. If k is too large, you might include bad candidates. If k is too small, you might exclude good candidates.

For nucleus sampling, you select the fewest top items where the cumulative probability is less than some specified probability threshold, p. So you convert logits to probabilities using the softmax() function, then compute the cumulative probabilities, then select the items where the cumulative probability is less than p, then randomly select one of the candidates.

For example:

next    logits   exp      prob   cum prob  
------------------------------------------
door    -0.54   0.5827   0.2201   0.2201
car     -0.65   0.5220   0.1972   0.4173
store   -0.79   0.4538   0.1714   0.5888
race    -0.98   0.3753   0.1418   0.7306
finish  -1.28   0.2780   0.1050   0.8356
monkey  -1.43   0.2393   0.0904   0.9260
apple   -1.63   0.1959   0.0740   1.0000
				
                2.6472	 1.0000	

I’ve listed the logits from large to small (logits are often, but not always, negative). The exp column is a scratch for the calculation of softmax-probabilities. Notice that the sum of the prob values is 1.0 as it should be.

With the cumulative probabilities in hand, you can specify a top-p value, such as p = 0.6 and then select those items where the cumulative probability is less than 0.6 which would be “door”, “car”, “store”, and then randomly pick one of these candidates.

Nucleus sampling isn’t magic because you still have to specify the threshold p value, but it’s easier to pick a good p value for nucleus sampling than it is to pick a good k value for top-k sampling.

The research paper which describes nucleus sampling is “The Curious Case of Neural Text Degeneration” by A. Holtzman et al. The paper presents some evidence that nucleus sampling generates more human-like text than top-k sampling.

I initially ran into nucleus sampling while working with the Hugging Face (HF) neural code library for NLP. The HF library has a weird function top_k_top_p_filtering() that combines top-k and top-p (nucleus) sampling.

Filtering logits for natural language — Tricky. Interesting.



Three dresses made from coffee filters. Tricky. Interesting.


Posted in Machine Learning | Leave a comment