The Netflix Data Privacy Experiment

One of the frequently cited research papers related to data privacy is “Robust De-Anonymization of Large Sparse Datasets”, A. Narayanan and V. Shmatikov, in Proceedings of the 2008 IEEE Symposium on Security and Privacy, May 2008. The paper examined a dataset of Netflix user movie reviews where personal information, such as user name, had been removed.

I’ve seen summaries of this paper used many times in technical articles and online Web sites, but most of the summaries have minor inaccuracies. Here is my summary of the two key results from the research paper.


Here is the algorithm presented in the research paper.

1.) When an adversary knows only a little bit of information about a particular record in the anonymized Netflix movie review dataset, the adversary can find the full record. Specifically, when an adversary knows just 8 movie ratings, of which 2 can be completely wrong, and dates that can have a 14-day error, 99% of records can be uniquely identified in the dataset. Note that personal information isn’t revealed because no personal information is in the dataset.

2.) By using IMDB movie review dataset information, which does have personal information supplied by users, it is possible to match IMDB reviews with Netflix reviews and therefore find personal information that was removed from the Netflix dataset.



The identities of the people who created ancient Egyptian art and jewelry will remain anonymous/private forever. Three modern interpretations in film. Left: “Caesar and Cleopatra” (1945). Center: “Gods of Egypt” (2016). Right: “Cleopatra” (1934).


Posted in Machine Learning | Leave a comment

Dealing With PyTorch Training Data That Has IDs

When working with PyTorch (or Keras) neural networks, a surprisingly tricky task is dealing with training data that has IDs. Data IDs are useful when analyzing a model to diagnose items that are incorrectly predicted.

You need to store the ID information, but you don’t want to feed the ID information to the neural network.

There are dozens of design choices but my preferred technique is to design a Dataset object that returns items with three fields: predictor values, target values, and ID values. This is a technique that is best understood by examining a concrete example.

Suppose the training data looks like:

train_0001, 5.1, 3.5, 1.4, 0.2, 0
train_0002, 4.9, 3.0, 1.4, 0.2, 0
. . .
train_0120, 6.9, 3.1, 5.4, 2.1, 2

This is the Iris dataset. The first column is a data ID that I added, the next four columns are predictor values (sepal length and width, petal length and width), and the last column is the species class label (0 = setosa, 1 = versicolor, 2 = virginica).

A PyTorch Dataset definition for the data is:

  def __init__(self, src_file, num_rows=None):
    # 'train_0001', 5.0, 3.5, 1.3, 0.3, 0
    tmp_all = np.loadtxt(src_file, max_rows=num_rows,
      usecols=range(0,6), delimiter=",", skiprows=0,
      comments="#", dtype=np.str)              # IDs are str
    tmp_x = tmp_all[:,1:5].astype(np.float32)  # cols 1,2,3,4
    tmp_y = tmp_all[:,5].astype(np.int64)      # col 5

    self.x_data = T.tensor(tmp_x, dtype=T.float32)
    self.y_data = T.tensor(tmp_y, dtype=T.int64)
    self.id_data = tmp_all[:,0].astype(np.str)

  def __len__(self):
    return len(self.x_data)

  def __getitem__(self, idx):
    preds = self.x_data[idx]
    spcs = self.y_data[idx] 
    id = self.id_data[idx]  # np.str
    sample = { 'predictors' : preds, 
               'species' : spcs,
               'id' : id }
    return sample

The data is read into memory as NumPy arrays, and the predictor and label values are converted to PyTorch tensors. A data item is returned as a Dictionary object with keys ‘predictors’, ‘species’, and ‘id’. You could return a data item as a tuple but using string keys like batch[‘predictors’] and batch[‘id]’ is less error-prone than index values like batch[0] and batch[2].

Accessing the data looks like this:

  for (b_ix, batch) in enumerate(dataldr):
    X = batch['predictors'] 
    Y = batch['species']
    id = batch['id']
    with T.no_grad():
      oupt = model(X)  # logits form
    print("ID = ", end=""); print(id)
    print("X = ", end=""); print(X)
    print("Y = ", end=""); print(Y)
    print("oupt = ", end=""); print(oupt)
    . . . 

The ID information is attached to each data item but isn’t fed to the network.

Learning how to use neural networks is a long journey that has many small conceptual sub-voyages.



The Matson Navigation Company was founded in 1882. The SS Mariposa and SS Monterey passenger ships were launched in 1931 and were famous as the most elegant way to travel to Hawaii and the South Pacific in the days before jet air travel. Reading about Matson ships and seeing them in old movies inspired me to want to work on a cruise ship, which I eventually did after I graduated from college (on the Royal Viking Line as an assistant cruise director).


Demo code:

# iris_ids.py
# iris example dealing with data IDs
# PyTorch 1.9.0-CPU Anaconda3-2020.02  Python 3.7.6
# Windows 10 

import numpy as np
import torch as T
device = T.device("cpu")  # apply to Tensor or Module

# -----------------------------------------------------------

class IrisDataset(T.utils.data.Dataset):
  def __init__(self, src_file, num_rows=None):
    # 'train_0001', 5.0, 3.5, 1.3, 0.3, 0
    tmp_all = np.loadtxt(src_file, max_rows=num_rows,
      usecols=range(0,6), delimiter=",", skiprows=0,
      comments="#", dtype=np.str)              # IDs are str
    tmp_x = tmp_all[:,1:5].astype(np.float32)  # cols 1,2,3,4
    tmp_y = tmp_all[:,5].astype(np.int64)      # col 5

    self.x_data = T.tensor(tmp_x, dtype=T.float32)
    self.y_data = T.tensor(tmp_y, dtype=T.int64)
    self.id_data = tmp_all[:,0].astype(np.str)

  def __len__(self):
    return len(self.x_data)

  def __getitem__(self, idx):
    preds = self.x_data[idx]
    spcs = self.y_data[idx] 
    id = self.id_data[idx]  # np.str
    sample = { 'predictors' : preds, 
               'species' : spcs,
               'id' : id }
    return sample

# -----------------------------------------------------------

class Net(T.nn.Module):
  def __init__(self):
    super(Net, self).__init__()
    self.hid1 = T.nn.Linear(4, 7)  # 4-7-3
    self.oupt = T.nn.Linear(7, 3)

    T.nn.init.xavier_uniform_(self.hid1.weight)
    T.nn.init.zeros_(self.hid1.bias)
    T.nn.init.xavier_uniform_(self.oupt.weight)
    T.nn.init.zeros_(self.oupt.bias)

  def forward(self, x):
    z = T.tanh(self.hid1(x))
    z = self.oupt(z)  # no softmax: CrossEntropyLoss() 
    return z

# -----------------------------------------------------------

def accuracy(model, dataset):
  # assumes model.eval() mode
  dataldr = T.utils.data.DataLoader(dataset, batch_size=1,
    shuffle=False)
  n_correct = 0; n_wrong = 0
  for (_, batch) in enumerate(dataldr):
    X = batch['predictors'] 
    Y = batch['species']  # already flattened by Dataset
    id = batch['id']
    with T.no_grad():
      oupt = model(X)  # logits form
    print("ID = ", end=""); print(id)
    print("X = ", end=""); print(X)
    print("Y = ", end=""); print(Y)
    print("oupt = ", end=""); print(oupt)
    input()

    big_idx = T.argmax(oupt)
    # if big_idx.item() == Y.item():
    if big_idx == Y:
      n_correct += 1
    else:
      n_wrong += 1

  acc = (n_correct * 1.0) / (n_correct + n_wrong)
  return acc

# -----------------------------------------------------------

def main():
  # 0. get started
  print("\nBegin PyTorch Iris dataset with IDs demo \n")
  T.manual_seed(1)
  np.random.seed(1)
  
  # 1. create DataLoader objects
  print("Creating Iris train and test DataLoader ")

  train_file = ".\\Data\\iris_train_with_ids.txt"
  test_file = ".\\Data\\iris_test_with_ids.txt"

  train_ds = IrisDataset(train_file, num_rows=120)
  test_ds = IrisDataset(test_file)  # 30 rows 

  bat_size = 4
  train_ldr = T.utils.data.DataLoader(train_ds,
    batch_size=bat_size, shuffle=True)
  test_ldr = T.utils.data.DataLoader(test_ds,
    batch_size=1, shuffle=False)

  # 2. create network
  net = Net().to(device)

  # 3. train model
  max_epochs = 12
  ep_log_interval = 2
  lrn_rate = 0.05

  loss_func = T.nn.CrossEntropyLoss()  # applies softmax()
  opt = T.optim.SGD(net.parameters(), lr=lrn_rate)

  print("\nbat_size = %3d " % bat_size)
  print("loss = " + str(loss_func))
  print("optimizer = SGD")
  print("max_epochs = %3d " % max_epochs)
  print("lrn_rate = %0.3f " % lrn_rate)

  print("\nStarting training")
  net.train()  # set the mode
  for epoch in range(0, max_epochs):
    epoch_loss = 0  # for one full epoch
    for (batch_idx, batch) in enumerate(train_ldr):
      X = batch['predictors']  # [10,4]
      Y = batch['species']  # OK; alreay flattened
      # do not use IDs during training   
      opt.zero_grad()
      oupt = net(X)
      loss_obj = loss_func(oupt, Y)  # a tensor
      epoch_loss += loss_obj.item()  # accumulate
      loss_obj.backward()
      opt.step()

    if epoch % ep_log_interval == 0:
      print("epoch = %4d   loss = %0.4f" % (epoch, epoch_loss))
  print("Done ")

  # 4. evaluate model accuracy
  print("\nComputing accuracy item-by-item \n")
  net.eval()
  acc = accuracy(net, train_ds)  # item-by-item
  print("Accuracy on train data = %0.4f" % acc)

  # 5. make a prediction
  print("\nPredicting species for [6.1, 3.1, 5.1, 1.1]: ")
  unk = np.array([[6.1, 3.1, 5.1, 1.1]], dtype=np.float32)
  unk = T.tensor(unk, dtype=T.float32).to(device) 

  with T.no_grad():
    logits = net(unk).to(device)  # do not sum to 1.0
  probs = T.softmax(logits, dim=1)  # to device 
  T.set_printoptions(precision=4)
  print(probs)

  # 6. save model (state_dict approach)
  print("\nSaving trained model state")
  fn = ".\\Models\\iris_model.pt"
  T.save(net.state_dict(), fn)

  # saved_model = Net()
  # saved_model.load_state_dict(T.load(fn))
  # use saved_model to make prediction(s)

  print("\nEnd Iris with IDs demo")

if __name__ == "__main__":
  main()
Posted in PyTorch | 1 Comment

NFL 2021 Week 11 Predictions – Zoltar Likes Vegas Favorite Chiefs over the Cowboys

Zoltar is my NFL football prediction computer program. It uses reinforcement learning and a neural network. Here are Zoltar’s predictions for week #11 of the 2021 season. It usually takes Zoltar about four weeks to hit his stride and takes humans about eight weeks to get up to speed, so weeks six through nine are usually Zoltar’s sweet spot. After week nine, injuries start having a big effect.

Zoltar:    patriots  by    0  dog =     falcons    Vegas:    patriots  by  3.5
Zoltar:       bills  by    6  dog =       colts    Vegas:       bills  by  7.5
Zoltar:    panthers  by    4  dog =    redskins    Vegas:    panthers  by    2
Zoltar:      ravens  by    0  dog =       bears    Vegas:      ravens  by    6
Zoltar:      browns  by    9  dog =       lions    Vegas:      browns  by   10
Zoltar: fortyniners  by    0  dog =     jaguars    Vegas: fortyniners  by  6.5
Zoltar:     packers  by    2  dog =     vikings    Vegas:     packers  by  2.5
Zoltar:    dolphins  by    0  dog =        jets    Vegas:    dolphins  by  2.5
Zoltar:      titans  by   11  dog =      texans    Vegas:      titans  by   10
Zoltar:      saints  by    0  dog =      eagles    Vegas:      saints  by    1
Zoltar:     raiders  by    5  dog =     bengals    Vegas:     bengals  by  0.5
Zoltar:      chiefs  by    6  dog =     cowboys    Vegas:      chiefs  by  2.5
Zoltar:   cardinals  by    0  dog =    seahawks    Vegas:    seahawks  by  1.5
Zoltar:    steelers  by    0  dog =    chargers    Vegas:    chargers  by  3.5
Zoltar:  buccaneers  by    7  dog =      giants    Vegas:  buccaneers  by 12.5

Zoltar theoretically suggests betting when the Vegas line is “significantly” different from Zoltar’s prediction. In mid-season I usually use 3.0 points difference but for the first few weeks of the season I go a bit more conservative and use 4.0 points difference as the advice threshold criterion. In middle weeks I sometimes go ultra-aggressive and use a 1.0-point threshold.

Note: Because of Zoltar’s initialization (all teams regress to an average power rating) and other algorithms, Zoltar is much too strongly biased towards Vegas underdogs. I need to fix this.

For week #11:

1. Zoltar likes Vegas underdog Falcons against the Patriots
2. Zoltar likes Vegas underdog Bears against the Ravens
3. Zoltar likes Vegas underdog Jaguars against the 49ers
4. Zoltar likes Vegas underdog Bengals against the Raiders
5. Zoltar likes Vegas favorites Chiefs over the Cowboys
6. Zoltar likes Vegas underdog Steelers against the Chargers
7. Zoltar likes Vegas underdog Giants against the Buccaneers

For example, a bet on the underdog Falcons against the Patriots will pay off if the Falcons win by any score, or if the favored Patriots win but by less than the point spread of 3.5 points (in other words, by 3 points or less).

Theoretically, if you must bet $110 to win $100 (typical in Vegas) then you’ll make money if you predict at 53% accuracy or better. But realistically, you need to predict at 60% accuracy or better.

In week #10, against the Vegas point spread, Zoltar went 3-2 (using the standard 3.0 points as the advice threshold). Overall, for the season, Zoltar is 38-29 against the spread (~56%).

Just for fun, I track how well Zoltar does when just trying to predict just which team will win a game. This isn’t useful except for parlay betting. In week #10, just predicting the winning team, Zoltar went 7-6 which is very poor — there were many upsets in week #10.

In week #10, just predicting the winning team, Vegas — “the wisdom of the crowd” also went 7-6 which is also terrible.

Zoltar sometimes predicts a 0-point margin of victory, which means the two teams are evenly matched. There are seven such games in week #11. In those situations, to pick a winner (only so I can track raw number of correct predictions) in the first few weeks of the season, Zoltar picks the home team to win. After that, Zoltar uses his algorithms to pick a winner.



My system is named after the Zoltar fortune teller machine that you can find in arcades. Arcade Zoltar is named after the machine from the 1988 movie “Big” where it grants the wish of a boy to become an adult. And movie Zoltar was named after an old arcade machine named Zoltan.

Posted in Zoltar | Leave a comment

How to Create a Transformer Architecture Model for Natural Language Processing in Visual Studio Magazine.

I wrote an article titled “How to Create a Transformer Architecture Model for Natural Language Processing” in the November 2021 edition of the online Microsoft Visual Studio Magazine. See https://visualstudiomagazine.com/articles/2021/11/03/transformer-architecture-model.aspx.

My article explains how to create a transformer architecture model for natural language processing. Specifically, the article shows how to create a model that accepts a sequence of words such as “The man ran through the {blank} door” and then predicts most-likely words to fill in the blank.

Transformer architecture (TA) models such as BERT (bidirectional encoder representations from transformers) and GPT (generative pretrained transformer) have revolutionized natural language processing (NLP). But TA systems are extremely complex, and implementing them from scratch can take hundreds or thousands of man-hours. The Hugging Face (HF) library is open source code that has pretrained TA models and an API set for working with the models. The HF library makes implementing NLP systems using TA models much less difficult.

The demo program begins by loading a pretrained DistilBERT language model into memory. DistilBERT is a condensed version of the huge BERT language model. The source sentence is passed to a Tokenizer object which breaks the sentence into words/tokens and assigns an integer ID to each token. For example, one of the tokens is “man” and its ID is 1299, and the token that represents the blank-word is [MASK] and its ID is 103.

The token IDs are passed to the DistilBERT model and the model computes the likelihoods of 28,996 possible words/tokens to fill in the blank. The top five candidates to fill in the blank for “The man ran through the {blank} door” are: “front,” “bathroom,” “kitchen,” “back” and “garage.”

One way to think about the fill-in-the-blank example presented in this article is that the DistilBERT model gives you an English language expert. You can ask this expert things such as what is the missing word in a sentence, or how similar two words are. But the DistilBERT expert doesn’t have specific knowledge about anything beyond pure English. For example, the basic DistilBERT model doesn’t know anything about movies. It is possible to start with a basic DistilBERT model and then fine-tune the model to give it knowledge about movie reviews in order to create a movie review expert. The fine-tuned expert will know about English and also about the difference between a good movie review and a bad review.



Artificial intelligence has come a long way, but it will be quite some time until AI can understand photos like these ones. Left: This criminal has his hands full of trouble. Center: This criminal is on the espresso lane to jail. Right: Oopsie loompa.


Posted in Machine Learning | Leave a comment

The Difference Between a Probability Density Function Value and Area

I was giving a lecture at the tech company I work for and there was a question from one of the attendees about the probability density function (PDF) for a Gaussian (aka Normal, bell-shaped) distribution. Briefly, the area under the PDF between two x values is the probability that a randomly generated x will be between those two values. For example, for a Gaussian with mean = 0 and standard deviation = 1, the probability that a randomly generated x is between 0.0 and 1.0 is the area under the curve between 0.0 and 1.0 which is approximately 0.3413.

The PDF value at x = 1.0 is approximately 0.2420. A PDF value can be used to compare the relative likelihoods of two different x values. For example, the PDF at x = 2.0 is about 0.0540 so getting x = 1.0 is more likely than getting x = 2.0. PDF values are not probabilities.

The total area under a Gaussian distribution is 1.0 but a PDF value can be greater than 1.0 if the distribution is squished, meaning it has a very small standard deviation.

In machine learning, probably the most common task related to probability distributions is to generate x values from a Gaussian distribution. Computing a PDF value is less common and can be easily done using a program-defined function or the scipy norm.pdf() function. To compute the area under the curve between two values (that is, the probability x is between two values), you can use the scipy norm.cdf() function (cumulative density function).



The Gaussian distribution is also known as the Normal distribution because, well, it’s mathematically normal. Two un-normal math photos. Left: Teaching students about angles at a U.S. high school. Explains a lot. Right: The concept of infinity that’s not so infinite.


Demo code:

# gaussian_pdf_demo.py

import numpy as np
from scipy.stats import norm

def my_pdf(x, u, sd):
  a = np.exp(-(u - sd)**2 / 2)
  b = np.sqrt(2 * np.pi)
  return a / b

print("\nBegin Gaussian pdf() demo ")
np.random.seed(1)

print("\nSampling 5 values from N(0,1) ")
for i in range(5):
  x = np.random.normal(loc=0.0, scale=1.0)
  print("x = %8.4f " % x)

print("\nComputing pdf() for x = 1.0 ")
y = norm.pdf(x=1.0, loc=0.0, scale=1.0)
print("%8.4f " % y)

y = my_pdf(x=1.0, u=0.0, sd=1.0)
print("%8.4f " % y)

print("\nEnd demo ")
Posted in Machine Learning | Leave a comment

Why Allowing Multiple Queries to a Dataset Weakens Differential Privacy

Differential privacy is a moderately complex security topic. Briefly, and loosely, if you have a dataset (such as Census data) you don’t want queries such as “What is the average age of people in the dataset?” to unintentionally reveal information about a specific person in the dataset.

One of the main ways to prevent security leakage is to add random noise to query results. For queries that return a numeric result, a common technique is to add a random value drawn from the Laplace distribution. (See my post at https://jamesmccaffrey.wordpress.com/2021/11/05/understanding-the-laplace-distribution-for-differential-privacy for an explanation). The idea is that the return result won’t be completely accurate but in many situations the approximate result is good enough to be useful.

However, if you allow users to repeatedly query a dataset, if enough queries are executed, a user can determine the true result, and the true result can potentially be used to reveal sensitive information. If many queries are issued, some of the noisy results will be greater than the true value and some of the noisy results will be less than the true value, but the average of the query results will approach the true value.

I coded up a quick demo. I set up an arbitrary true dataset value of 33. For queries, I returned the true value plus a Laplace noise with loc (mean) = 0 and scale (spread) = 1. For 100 queries, most of the return results were more than 1 away from the true value of 33. But the average of the query results was 32.62 — with 0.38 of the true value.

The moral of the story is that security is tricky and failure can have bad consequences.



Dog failure has fewer consequences than computer security failure.


Demo code:

# diff_priv_multiple_queries.py

import numpy as np

print("\nBegin multiple queries demo ")
np.random.seed(0)
print("\nSetting true dataset query result = 33 ")
print("Setting Laplace noise loc = 0.0, scale = 1.0 \n")

true_result = 33
sum_query_results = 0.0

for i in range (100):
  noise = np.random.laplace(loc=0.0, scale=3.0)
  query_result = true_result + noise

  if i % 10 == 0:
    print("query # %3d " % i, end="")
    print("query result = %6.2f " % query_result, end="")

    if np.abs(true_result - query_result) < 1.0:
      print("within 1.0 is TRUE ")
    else:
      print("within 1.0 is FALSE ")

  sum_query_results += query_result

avg_query_result = sum_query_results / (i+1)
print("\navg_query result = %6.2f " % avg_query_result)
if np.abs(true_result - avg_query_result) < 1.0:
  print("avg_query_result within 1.0 of true result is TRUE ")
else:
  print("avg_query_result within 1.0 of true result is FALSE ")

print("\nEnd demo ")
Posted in Machine Learning | Leave a comment

Positive and Unlabeled Learning: How Complex is Too Complex?

One of my ongoing projects is to design an improved algorithm for PUL (positive and unlabeled learning). The problem scenario is that you have some data where the class label to predict is class 1 = positive, and other data that is unlabeled, meaning it could be either class 0 = negative or class 1 = positive. The goal is to analyze the unlabeled data and guess if each is either class 0 or class 1.

Medical data is often PUL — a few patients have a disease, but many thousands of patients are unlabeled.

I’ve designed a neural-based PUL system that seems to work very well . . . sort of. The problem is that the system I designed is very complex because it has dozens of hyperparameters. Examples include neural architecture (number layers, activations, etc.), neural training (batch size, learning rate, etc.), and dozens of PUL system design choices.

Based on my years of experience, my PUL system, as it stands now, is interesting and possibly useful from a research / theoretical perspective, but the system is less useful from a practical perspective. I’ve worked in several software production environments, and in many situations system simplicity is more important than a small increase in performance. Put somewhat differently, this PUL system might be useful for one-off data analysis and experimentation but not as useful as a black box system.

My demo data looks like this:

# patients_positive.txt
1    0.24   1   0   0   0.2950   0   0   1   1
1    0.45   0   1   0   0.5410   0   1   0   1
1    0.55   0   0   1   0.6460   1   0   0   1
. . . 

# patients_unlabeled.txt
-9   0.39   0   0   1   0.5120   0   1   0   0   
-9   0.36   1   0   0   0.4450   0   1   0   0   
-9   0.50   0   1   0   0.5650   0   1   0   1   
. . .

The data is synthetic. The first column holds a label indicating if the patient has a disease, where -9 indicates unlabeled and 1 indicates positive. The next columns are predictor variables. The last column holds the true class label, 0 or 1, so I can evaluate the accuracy of the PUL system. There are 20 positive data items and 180 unlabeled items.



The output of the system is a pair of probabilities for each unlabeled data item, for example [0.123, 0.877], where the first value is probability of class 0, and second value is probability of class 1. The system uses a delta threshold where only those items where the difference between the prob(0) and prob(1) is greater than the delta, are used to make predictions. For example, if the threshold is 0.50 then a result like [0.20, 0.80] is used (prediction is class 1) but a result like [0.45, 0.55] isn’t used because the probabilities are too close together.

My neural system achieves 85% accuracy using a threshold = 0.50, but with such as large threshold only 33 of the 180 unlabeled data items are predicted (29 correct, 5 wrong). A smaller threshold makes more predictions but with lower accuracy.

One of my colleagues, Alexandra S., pointed out that in PUL systems it’s often important to have a human in the loop. In other words, for the synthetic patient data of my demo, those unlabeled items that are marked as class 1 should not be automatically assumed to be class 1 with absolute certainty — the items should be thought of as possibly class 1 and then examined closely by a human.



Unlabeled data hides its true identity. Masks do the same for people. The Venice Carnival runs roughly the two weeks before Lent — the 40 days preceding Easter — and has featured beautiful masks and costumes since the 12th century. For these masks, more complexity is more appealing (to me anyway).


Posted in Machine Learning | Leave a comment

NFL 2021 Week 10 Predictions – Zoltar Has Five Highly Questionable Suggestions

Zoltar is my NFL football prediction computer program. It uses reinforcement learning and a neural network. Here are Zoltar’s predictions for week #10 of the 2021 season. It usually takes Zoltar about four weeks to hit his stride and takes humans about eight weeks to get up to speed, so weeks six through nine are usually Zoltar’s sweet spot. After week nine, injuries start having a big effect.

Zoltar:      ravens  by    2  dog =    dolphins    Vegas:      ravens  by  6.5
Zoltar:       colts  by    8  dog =     jaguars    Vegas:       colts  by 10.5
Zoltar:     cowboys  by    6  dog =     falcons    Vegas:     cowboys  by  3.5
Zoltar:      browns  by    0  dog =    patriots    Vegas:    patriots  by    3
Zoltar:       bills  by    4  dog =        jets    Vegas:       bills  by 13.5
Zoltar:      titans  by    6  dog =      saints    Vegas:      titans  by  2.5
Zoltar:    steelers  by   11  dog =       lions    Vegas:    steelers  by  9.5
Zoltar:  buccaneers  by    3  dog =    redskins    Vegas:  buccaneers  by  7.5
Zoltar:   cardinals  by    9  dog =    panthers    Vegas:   cardinals  by   10
Zoltar:    chargers  by    6  dog =     vikings    Vegas:    chargers  by    3
Zoltar:     broncos  by    6  dog =      eagles    Vegas:     broncos  by  1.5
Zoltar:     packers  by    6  dog =    seahawks    Vegas:     packers  by    5
Zoltar:      chiefs  by    0  dog =     raiders    Vegas:      chiefs  by    3
Zoltar:        rams  by    0  dog = fortyniners    Vegas:        rams  by    3

Zoltar theoretically suggests betting when the Vegas line is “significantly” different from Zoltar’s prediction. In mid-season I usually use 3.0 points difference but for the first few weeks of the season I go a bit more conservative and use 4.0 points difference as the advice threshold criterion. In middle weeks I sometimes go ultra-aggressive and use a 1.0-point threshold.

Note: Because of Zoltar’s initialization (all teams regress to an average power rating) and other algorithms, Zoltar is much too strongly biased towards Vegas underdogs. I need to fix this.

For week #10:

1. Zoltar likes Vegas underdog Dolphins against the Ravens.
2. Zoltar likes Vegas underdog Jets against the Bills.
3. Zoltar likes Vegas favorite Titans over the Saints.
4. Zoltar likes Vegas underdog Redskins against the Buccaneers.
5. Zoltar likes Vegas favorite Broncos over the Eagles.

For example, a bet on the underdog Dolphins against the Ravens will pay off if the Dolphins win by any score, or if the favored Ravens win but by less than the point spread of 6.5 points (in other words, by 6 points or less).

These predictions are really sketchy – Vegas underdogs Dolphins, Jets, Redskins looked terrible in week #9.

Theoretically, if you must bet $110 to win $100 (typical in Vegas) then you’ll make money if you predict at 53% accuracy or better. But realistically, you need to predict at 60% accuracy or better.

In week #9, against the Vegas point spread, Zoltar went 2-1 (using the standard 3.0 points as the advice threshold). Overall, for the season, Zoltar is 35-27 against the spread (56%).

Just for fun, I track how well Zoltar does when just trying to predict just which team will win a game. This isn’t useful except for parlay betting. In week #9, just predicting the winning team, Zoltar went 8-6 which is very poor — a lot of upsets in week #9.

In week #9, just predicting the winning team, Vegas — “the wisdom of the crowd” — also went 8-6 which is also terrible.

Zoltar sometimes predicts a 0-point margin of victory, which means the two teams are evenly matched. There are three such games in week #10. In those situations, to pick a winner (only so I can track raw number of correct predictions) in the first few weeks of the season, Zoltar picks the home team to win. After that, Zoltar uses his algorithms to pick a winner.



My system is named after the Zoltar fortune teller machine you can find in arcades. Zoltar (the machine) uses a crystal ball to make his predictions. Many movies of the 1920s and 1930s featured crystal balls — a lot of people in that decade truly believed in spiritualism. Left: “The Black Watch” (1929) is about the British Army in India. Center: “Sinister Hands” (1932) is a classic murder mystery. Right: “The Black Camel” (1931) is a mystery featuring detective Charlie Chan, along with Bela Lugosi as a fortune teller.

Posted in Zoltar | Leave a comment

Using a Hugging Face Fine-Tuned Binary Classification Model

I’ve been taking a deep dive into the Hugging Face (HF) open-source code library for natural language processing (NLP) with a transformer architecture (TA) model.

In previous explorations, I fine-tuned a pretrained HF DistilBERT model (110 million parameters) to classify movie reviews as 0 (negative) or 1 (positive), and I also wrote a function to compute the classification accuracy of the tuned model.

Today I coded up a demo that uses the tuned model to predict the sentiment of an arbitrary new movie review. To do so I had to specify a review using raw text (“This was a GREAT waste of my time.”), then convert the review text to token IDs (integers like “this” = 2023), then feed the tokenized review to the tuned model and fetch the results, then interpret the results.

The pretrained model knows all about the English language, such as the words “movie” and “flick” mean the same thing when the context is cinema, but “flick” can mean a sudden sharp movement in other contexts. But the pretrained model doesn’t know anything about movie review sentiment so the pretrained model must be fine-tuned to understand things like “flop” means a bad movie.

Each of the steps was conceptually simple but had many technical details to deal with. But after a few hours of work I got a demo up and running. Many of the technical problems that I ran into caused me less trouble than expected because I’d seen many similar types of problems while working with literally hundreds of PyTorch models over the past 4 years. This is one of the main the values of experience — you can solve problems much more quickly.

It was a very interesting exploration and I can say that I have a good grasp of using a fine-tuned HF classification model. My next set of experiments will try to create an autoencoder model based on a pretrained HF model. I have no idea where to start but I’m sure I’ll figure things out . . . eventually.



Many of the biggest movie box office money losers have been science fiction films. Here are three such movies that collectively lost hundreds of millions of dollars. In my opinion, all three are OK but not quite good — they needed fine-tuning.

Left: “John Carter” (2015) lost over $200 million. The actor came across as an idiot, the actress came across as an annoying harpy, the plot was hard to follow, and the dialogue/sound was nearly impossible to understand without sub-titles.

Center: “A Sound of Thunder” (2005) lost about $100 million. The production abruptly ran out of money and the editing suffered greatly.

Right: “Valerian and the City of a Thousand Planets” (2017) lost about $100 million. Incomprehensible choice of lead actor and actress. The lead actor came across as an effeminate wimp and the actress came across as a masculine bully.


Demo code:

# imdb_hf_03_use.py
# use tuned HF model for IMDB sentiment analysis accuracy
# zipped raw data at:
# https://ai.stanford.edu/~amaas/data/sentiment/

import numpy as np  # not used
from transformers import DistilBertTokenizerFast
import torch
from transformers import DistilBertForSequenceClassification
from transformers import logging  # to suppress warnings

device = torch.device('cpu')

def main():
  # 0. get ready
  print("\nBegin use IMDB HF model demo ")
  logging.set_verbosity_error()  # suppress wordy warnings
  torch.manual_seed(1)
  np.random.seed(1)

  # 1. load pretrained model
  print("\nLoading untuned DistilBERT model ")
  model = \
    DistilBertForSequenceClassification.from_pretrained( \
    'distilbert-base-uncased')
  model.to(device)
  print("Done ")

  # 2. load tuned model wts and biases
  print("\nLoading tuned model wts and biases ")
  model.load_state_dict(torch.load(\
    ".\\Models\\imdb_state.pt"))
  model.eval()
  print("Done ")

  # 3. set up input review
  review_text = ["This was a GREAT waste of my time."]
  print("\nreview_text = ")
  print(review_text)

  tokenizer = \
    DistilBertTokenizerFast.from_pretrained(\
    'distilbert-base-uncased')
  review_tokenized = \
    tokenizer(review_text, truncation=True, padding=True)
  
  print("\nreview_tokenized = ")
  print(review_tokenized)
  # {'input_ids': [[101, 2023, 2001, 1037, 2307, 5949,
  #    1997, 2026, 2051, 1012, 102]],
  #  'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

  input_ids = review_tokenized['input_ids']
  print("\nTokens: ")
  for id in input_ids[0]:
    tok = tokenizer.decode(id)
    print("%6d %s " % (id, tok))

  input_ids = torch.tensor(input_ids).to(device)
  mask = torch.tensor(review_tokenized['attention_mask']).\
    to(device)
  dummy_label = torch.tensor([0]).to(device)

  # 4. feed review to model, fetch result
  with torch.no_grad():
    outputs = model(input_ids, \
      attention_mask=mask, labels=dummy_label)
  print("\noutputs = ")
  print(outputs)
  # SequenceClassifierOutput(
  # loss=tensor(0.1055),
  # logits=tensor([[ 0.9256, -1.2700]]),
  # hidden_states=None,
  # attentions=None)

  # 5. interpret result
  logits = outputs[1]
  print("\nlogits = ")
  print(logits)

  pred_class = torch.argmax(logits, dim=1)
  print("\npred_class = ")
  print(pred_class)

  print("\nEnd demo ")

if __name__ == "__main__":
  main()
Posted in PyTorch | Leave a comment

An Example of Locality-Sensitive Hashing

I was working with differential privacy recently and the topic of locality-sensitive hashing (LSH) came up. The Wikipedia definition of locality-sensitive hashing is: “Locality-sensitive hashing is an algorithmic technique that hashes similar input items into the same buckets with high probability.”

Put another way, an LSH function accepts any kind of input (a numeric vector, a string, a text file, an image, etc.) and returns a single integer (bucket) value in such a way that similar input items return the same bucket value.

Even though it’s possible to write a generic LSH function that handles any kind of input (which requires the raw input to be converted to bytes), LSH functions are often program-defined and specific to different problem scenarios.

Here’s a concrete example of a custom LSH function. Suppose the inputs are numeric 2D vectors where each element is between 0.0 and 10.0, for example [2.0, 3.0] or [6.5, 0.4]. The maximum distance for any input vector to the origin at [0, 0] is sqrt(10^2 + 10^2) = sqrt(200) = 14.1421. And suppose you specify 3 buckets (0, 1, or 2). Define an LSH function as: If the computed distance is between [0.0, 5.0] return bucket 0, if between [5.0, 10.0] return bucket 1, if between [10.0, 15.0] return bucket 2.

With this design inputs of [0.0, 1.0] and [1.5, 1.5] both return bucket 0.

If you think about it, LSH can be thought of as a clustering algorithm where the bucket number is synonymous with cluster ID.

Locality-sensitive hashing is often used with text data which is much more difficult to work with than numeric data because it’s more difficult to compare text data than it is to compare numeric data. But the general principles are the same.



Last night I watched the 2021 movie version of “Dune” based on the 1965 novel by Frank Herbert. A cluster of three book covers. Left: Hardcover first edition (1965). Center: Paperback (1967). Right: Paperback (1984).


Demo code:

# lsh_demo.py

# Wikipedia: locality-sensitive hashing (LSH) is an
# algorithmic technique that hashes similar input items
# into the same "buckets" with high probability

import numpy as np

def lsh_bucket(x, n):
  # x is a 2D vector where each element is in [0.0, 10.0]
  # return is in [0, n-1] -- n buckets
  dist = np.sqrt(x[0]**2 + x[1]**2)  # Euclidean to [0,0]
  # max dist = sqrt(10^2 + 10^2) = sqrt(200) = 14.1421
  delta = 15.0 / n
  for i in range(n):
    if dist < delta * (i+1):
      return i
  return n-1  # 
   
print("\nBegin locality-sensitive hashing (LSH) demo \n")

x = np.array([0.0, 0.0])
bucket = lsh_bucket(x, 4)
print("x = ", end=""); print(x)
print("bucket = %d \n" % bucket)

x = np.array([4.0, 5.0])
bucket = lsh_bucket(x, 4)
print("x = ", end=""); print(x)
print("bucket = %d \n" % bucket)

x = np.array([10.0, 10.0])
bucket = lsh_bucket(x, 4)
print("x = ", end=""); print(x)
print("bucket = %d \n" % bucket)

print("End demo ")
Posted in Machine Learning | Leave a comment