Why I Don’t Use Neural Network Early Stopping

I was chit-chatting with a work colleague about early stopping. We were next to our workplace coffee-machine / engineer-fueling-station. I almost never use early stopping for neural network training. Briefly, 1.) it’s not really possible to determine when to stop, and 2.) the gain from early stopping doesn’t outweigh the downside cost of having less training data due to the need for validation data.

The graph on the left shows early stopping in theory. Instead of dividing your labeled data into a train (maybe 80% of the data) and a test (20%) dataset as usual, you divide into train (perhaps 70%), validate (15%), and test (15%) datasets. During training, every so often you evaluate the network error/loss on the training and validation data. At some point the validation error will start to increase, indicating that overfitting has started to occur.

Unfortunately, the nice smooth, easy-to-interpret graph on the left almost never happens. The graph on the right shows validation error on an actual training run (I found the graph on the Internet). The graph jumps around and if you stopped training around epoch 50, when validation error starts to increase, you’d be early stopping too early.

There are some scenarios where a conservative form of early stopping is useful. For example, if you are doing programmatic hyperparameter optimization/tuning, a bad early-stop doesn’t hurt too much because you’ll likely get another chance so to speak.



I don’t know much about art, but I suspect that an artist has a difficult time knowing when to stop work on a piece of art. Here are three left-facing portraits by artists whose work I admire a lot. Left: By Hans Jochem Bakker. Center: By Randy Monteith. Right: By Daniel Arrhakis.


Posted in Machine Learning | Leave a comment

NFL 2022 Week 10 Predictions – Zoltar Thinks Underdogs Packers and Steelers Will Win Outright

Zoltar is my NFL football prediction computer program. It uses reinforcement learning and a neural network. Here are Zoltar’s predictions for week #10 of the 2022 season.

Zoltar:     falcons  by    0  dog =    panthers    Vegas:     falcons  by    3
Zoltar:  buccaneers  by    3  dog =    seahawks    Vegas:  buccaneers  by  2.5
Zoltar:       bills  by    6  dog =     vikings    Vegas:       bills  by    6
Zoltar:       bears  by    6  dog =       lions    Vegas:       bears  by    3
Zoltar:      chiefs  by   10  dog =     jaguars    Vegas:      chiefs  by  9.5
Zoltar:    dolphins  by    4  dog =      browns    Vegas:    dolphins  by    4
Zoltar:      giants  by    2  dog =      texans    Vegas:      giants  by  6.5
Zoltar:      titans  by    6  dog =     broncos    Vegas:      titans  by    3
Zoltar:    steelers  by    4  dog =      saints    Vegas:      saints  by  2.5
Zoltar:     raiders  by    5  dog =       colts    Vegas:     raiders  by    6
Zoltar:     packers  by    4  dog =     cowboys    Vegas:     cowboys  by  5.5
Zoltar:        rams  by    5  dog =   cardinals    Vegas:        rams  by    3
Zoltar: fortyniners  by    5  dog =    chargers    Vegas: fortyniners  by    7
Zoltar:      eagles  by    5  dog =  commanders    Vegas:      eagles  by 10.5

Zoltar theoretically suggests betting when the Vegas line is “significantly” different from Zoltar’s prediction. For this season I’ve been using a threshold of 4 points difference but in some previous seasons I used 3 points.

At the beginning of the season, because of Zoltar’s initialization (all teams regress to an average power rating) and other algorithms, Zoltar is very strongly biased towards Vegas underdogs. I probably need to fix this. For week #10 Zoltar likes four Vegas underdogs:

1. Zoltar likes Vegas underdog Texans against the Giants.
2. Zoltar likes Vegas underdog Steelers against the Saints.
3. Zoltar likes Vegas underdog Packers against the Cowboys.
4. Zoltar likes Vegas underdog Commanders against the Eagles.

For example, a bet on the underdog Texans against the Giants will pay off if the Texans win by any score, or if the favored Giants win but by less than 6.5 points (i.e., 6 points or less). If a favored team wins by exactly the point spread, the wager is a push. This is why point spreads often have a 0.5 added — called “the hook” — to eliminate pushes.

This is the part of the season where injuries start having a big effect on the point spread. I’m surprised that Zoltar thinks the Green Bay Packers are better than the Dallas Cowboys. To the human eye, the Packers have looked terrible and the Cowboys look like champion contenders. We’ll see.

Theoretically, if you must bet $110 to win $100 (typical in Vegas) then you’ll make money if you predict at 53% accuracy or better. But realistically, you need to predict at 60% accuracy or better.

In week #9, against the Vegas point spread, Zoltar went 5-1 (using 4.0 points as the advice threshold). Quite lucky because several big Vegas favorites won easily but didn’t quite cover the spread.

For the season, against the spread, Zoltar is 33-16 (~67% accuracy).

Just for fun, I track how well Zoltar does when just trying to predict just which team will win a game. This isn’t useful except for parlay betting. In week #9, just predicting the winning team, Zoltar went only 6-7 which is terrible. Vegas was much better, going 9-4 at just predicting the winning team.

Zoltar sometimes predicts a 0-point margin of victory. There is one such game in week #10 (Falcons vs. Panthers). In those situations, to pick a winner (only so I can track raw number of correct predictions) in the first few weeks of the season, Zoltar picks the home team to win. After that, Zoltar uses his algorithms to pick a winner.



My prediction system is math-based in the sense that it computes a numeric rating for each team and then uses ratings to compute the predicted margin of victory of the better team. An entirely different approach for predicting NFL football scores is to use a sophisticated simulation program and then simulate a game many thousands of times.

Left: There are many versions of simple football simulation games where a player rolls two ordinary dice and the result is based on the 21 (order doesn’t matter) or 36 (order matters) possible outcomes. This one is called simply “Football Board Game” by The Blue Crab company of Sunrise, Florida.

Center: “1st and Goal” by R and R Games is a fairly sophisticated game that uses several kinds of specialized dice and cards.

Right: “Half-Time Football” by Lakeside was produced in the late 1970s and early 80s. It’s strictly a dice game and is medium complexity.

Posted in Zoltar | Leave a comment

PyTorch Model Training Using a Program-Defined Function vs. Inline Code

In most cases, I train a PyTorch model using inline code inside a main() function. For example:

import torch as T
. . .
def main():
  # 0. get started
  print("Begin People predict politics type ")
  T.manual_seed(1)
  np.random.seed(1)
  
  # 1. create Dataset objects
  print("Creating People Datasets ")
  train_file = ".\\Data\\people_train.txt"
  train_ds = PeopleDataset(train_file) 

  # 2. create network
  print("Creating 6-(10-10)-3 tanh network ")
  net = Net().to(device)
  net.train()  # set mode

  # 3. train
  max_epochs = 1000
  ep_log_interval = 100
  lrn_rate = 0.01
  loss_func = T.nn.NLLLoss()  # assumes log_softmax()
  optimizer = T.optim.SGD(net.parameters(), lr=lrn_rate)

  print("Starting training")
  for epoch in range(0, max_epochs):
    epoch_loss = 0  # for one full epoch
    for (batch_idx, batch) in enumerate(train_ldr):
      X = batch[0]  # inputs
      Y = batch[1]  # correct class/label/politics
      optimizer.zero_grad()
      oupt = net(X)
      loss_val = loss_func(oupt, Y)  # a tensor
      epoch_loss += loss_val.item()  # accumulate
      loss_val.backward()
      optimizer.step()

    if epoch % ep_log_interval == 0:
      print("epoch = %5d  |  loss = %10.4f" % \
        (epoch, epoch_loss))
  print("Training done ")
  . . .

An alternative approach is to use a program-defined training function like so:

. . .
def train(net, ds, opt, lr, bs, me, le):
  train_ldr = T.utils.data.DataLoader(ds, batch_size=bs,
    shuffle=True)  
  loss_func = T.nn.NLLLoss()  # assumes log_softmax()
  if opt == 'sgd':
    optimizer = T.optim.SGD(net.parameters(), lr=lr)
  elif opt == 'adam':
    optimizer = T.optim.Adam(net.parameters(), lr=lr)
  # else error

  for epoch in range(0, me):
    epoch_loss = 0.0  # for one full epoch
    for (batch_idx, batch) in enumerate(train_ldr):
      X = batch[0]  # inputs
      Y = batch[1]  # correct class/label/politics
      optimizer.zero_grad()
      oupt = net(X)
      loss_val = loss_func(oupt, Y)  # a tensor
      epoch_loss += loss_val.item()  # accumulate
      loss_val.backward()
      optimizer.step()

    if epoch % le == 0:
      print("epoch = %5d  |  loss = %10.4f" % \
        (epoch, epoch_loss))
  print("Training done ") 
  return net 

def main():
  # 0. get started
  print("Begin People predict politics type ")
  T.manual_seed(1)
  np.random.seed(1)
  
  # 1. create Dataset objects
  print("Creating People Datasets ")
  train_file = ".\\Data\\people_train.txt"
  train_ds = PeopleDataset(train_file)  # 200 rows

# -----------------------------------------------------------

  # 2. create network
  print("Creating 6-(10-10)-3 tanh network ")
  net = Net().to(device)
  net.train()

# -----------------------------------------------------------

  # 3. train
  opt = 'sgd'  # optimizer algorithm
  lr = 0.01    # learning rate
  bs = 10      # batch size
  me = 1000    # max training epochs
  le = 200     # monitor loss every n epochs

  print("Starting training ")
  train(net, train_ds, opt, lr, bs, me, le)
  print("Done ")
. . .

Using a program-defined train() function is useful when doing hyperparameter optimization because the training code will be called many times. But for regular scenarios, defining a train() function isn’t as clear as using a simple online approach.

The moral of the story is that when writing code, there are always multiple design options where experience and intuition guide the design.



Minicomputers dominated the 1970s. They were a response to very expensive mainframes of the 1960s. Minicomputer designs varied wildly but most were 16-bit machines, and had processors that ran at about 12 MHz, and could support 16 concurrent users. With peripherals, a minicomputer system cost about $100,000 (roughly $800,000 today) which made them the hardware of choice for colleges and universities of the 1970s. The PCs of the 1980s killed minicomputers.

Left: The Hewlett-Packard HP-2100. Center: The Digital Equipment Corporation PDP-11. My first programming was done on a PDP-11 using the RSTS operating system, when I was a student at U.C. Irvine. Right: The Control Data Corporation CDC-1700.


Posted in PyTorch | 1 Comment

An Example of PyTorch Hyperparameter Random Search

Bottom line: Hyperparameter random search can be effective but the difficult part is determining what to parameterize and the range of possible parameter values.

When creating a neural network prediction model there are many architecture hyperparameters (number hidden layers, number of nodes in each hidden layer, hidden activation functions, initialization algorithms and their parameters, etc., etc.) And then there are dozens of training hyperparameters (optimization algorithm, learning rate, momentum, batch size, number training epochs, etc.)


In this demo, the best parameters were in trial 2 with 16 hidden nodes, tanh hidden activation, Adam optimization, learn rate = 0.01809, batch size = 14, and max_epochs = 799

Most of my colleagues and I use a manual approach for finding good hyperparameters. We use our experience and intuition. It’s possible to programmatically search for good hyperparameters. Somewhat surprisingly, a random search of hyperparameter values is highly effective compared to more sophisticated techniques, grid search in particular. See “Random Search for Hyper-Parameter Optimization” (2012) by J. Bergstra and Y. Bengio.

I put together a demo of hyperparameter random search. My demo problem is to predict a person’s political leaning (conservative, moderate, liberal) from sex, age, state, and income. In pseudo-code the idea is:

  # loop n times
  #   create random arch and train hyperparams
  #   use arch params to create net
  #   use train params to train net
  #   evaluate trained net
  #   log params and eval metric to file
  # end-loop
  # analyze log offline

I used just two architecture parameters: number of hidden nodes and hidden activation function. The architecture had fixed two hidden layers.

I used just four training parameters: optimization algorithm, learning rate, batch size, and max epochs. Here’s my demo function that generates random hyperparameters:

def create_params(seed=1):
  # n_hid, activation; opt, lr, bs, max_ep
  rnd = np.random.RandomState(seed)

  n_hid = rnd.randint(6, 21)  # [6, 20]
  activation = ['tanh', 'relu'][rnd.randint(0,2)]

  opt = ['sgd', 'adam'][rnd.randint(0,2)]
  lr = rnd.uniform(low=0.001, high=0.10)
  bs = rnd.randint(6, 16)
  max_ep = rnd.randint(200, 1000)

  return (n_hid, activation, opt, lr, bs, max_ep)

The number of hidden nodes varies from 6 to 20, the learning rate varies from 0.001 to 0.01, and so on. Where do these ranges come from? Just guesses based on experience.

There are dozens of details, such as how to evaluate a trained network.

So, hyperparameter search isn’t a magic wand — you have to use experience to determine which of the hundreds of possible parameters to search, and which of the literally infinite ranges for parameter values to use.

One of the disadvantages of random search is that you can get ugly results, such as a learning rate of 0.10243568790223344556677123. One way to deal with this issue is to round floating point values to three decimals and integers to a power of 10 before trying them.



Like many of the older guys I work with, I gained a love of reading from the juvenile “Hardy Boys” mystery series. Several of the books featured a search for something, such as a treasure of some kind. Left: “The Tower Treasure” (#1, 1959 edition). Center: “Hunting for Hidden Gold (#5, 1963 edition). Right: “The Secret of Pirates’ Hill” (#36, 1956 edition). All three covers by artist Rudy Nappi (1923-2015).


Demo code.

# people_hyperparam_search.py
# predict politics type from sex, age, state, income
# PyTorch 1.12.1-CPU Anaconda3-2020.02  Python 3.7.6
# Windows 10/11 

import numpy as np
import torch as T
device = T.device('cpu')  # apply to Tensor or Module

# -----------------------------------------------------------

class PeopleDataset(T.utils.data.Dataset):
  # sex  age    state    income   politics
  # -1   0.27   0  1  0   0.7610   2
  # +1   0.19   0  0  1   0.6550   0
  # sex: -1 = male, +1 = female
  # state: michigan, nebraska, oklahoma
  # politics: conservative, moderate, liberal

  def __init__(self, src_file):
    all_xy = np.loadtxt(src_file, usecols=range(0,7),
      delimiter="\t", comments="#", dtype=np.float32)
    tmp_x = all_xy[:,0:6]   # cols [0,6) = [0,5]
    tmp_y = all_xy[:,6]     # 1-D

    self.x_data = T.tensor(tmp_x, 
      dtype=T.float32).to(device)
    self.y_data = T.tensor(tmp_y,
      dtype=T.int64).to(device)  # 1-D

  def __len__(self):
    return len(self.x_data)

  def __getitem__(self, idx):
    preds = self.x_data[idx]
    trgts = self.y_data[idx] 
    return preds, trgts  # as a Tuple

# -----------------------------------------------------------

class Net(T.nn.Module):
  def __init__(self, n_hid, activation):
    super(Net, self).__init__()
    self.hid1 = T.nn.Linear(6, n_hid)  # 6-(nh-nh)-3
    self.hid2 = T.nn.Linear(n_hid, n_hid)
    self.oupt = T.nn.Linear(n_hid, 3)

    if activation == 'tanh':
      self.activ = T.nn.Tanh()
    elif activation == 'relu':
      self.activ = T.nn.ReLU()
    
    T.nn.init.xavier_uniform_(self.hid1.weight)
    T.nn.init.zeros_(self.hid1.bias)
    T.nn.init.xavier_uniform_(self.hid2.weight)
    T.nn.init.zeros_(self.hid2.bias)
    T.nn.init.xavier_uniform_(self.oupt.weight)
    T.nn.init.zeros_(self.oupt.bias)

  def forward(self, x):
    z = self.activ(self.hid1(x))
    z = self.activ(self.hid2(z))
    z = T.log_softmax(self.oupt(z), dim=1)  # NLLLoss() 
    return z

# -----------------------------------------------------------
 
def overall_loss_3(model, ds, n_class):
  # MSE using built in MSELoss() version
  X = ds[0:len(ds)][0]  # all X values
  Y = ds[0:len(ds)][1]  # all targets, ordinal form
  YY = T.nn.functional.one_hot(Y, num_classes=n_class) 
  
  with T.no_grad():
    oupt = T.exp(model(X))  #  [all,3]  probs form

  loss_func = T.nn.MSELoss(reduction='sum')
  loss_val = loss_func(oupt, YY)  # a tensor
  mse = loss_val / len(ds)
  return mse  # as tensor

# -----------------------------------------------------------

def accuracy(model, dataset):
  # assumes model.eval()
  X = dataset[0:len(dataset)][0]
  # Y = T.flatten(dataset[0:len(dataset)][1])
  Y = dataset[0:len(dataset)][1]
  with T.no_grad():
    oupt = model(X)  #  [40,3]  logits

  # (_, arg_maxs) = T.max(oupt, dim=1)
  arg_maxs = T.argmax(oupt, dim=1)  # argmax() is new
  num_correct = T.sum(Y==arg_maxs)
  acc = (num_correct * 1.0 / len(dataset))
  return acc.item()

# -----------------------------------------------------------

def train(net, ds, opt, lr, bs, me, le):
  train_ldr = T.utils.data.DataLoader(ds, batch_size=bs,
    shuffle=True)  

  loss_func = T.nn.NLLLoss()  # assumes log_softmax()
  if opt == 'sgd':
    optimizer = T.optim.SGD(net.parameters(), lr=lr)
  elif opt == 'adam':
    optimizer = T.optim.Adam(net.parameters(), lr=lr)
  # else error

  for epoch in range(0, me):
    epoch_loss = 0.0  # for one full epoch
    for (batch_idx, batch) in enumerate(train_ldr):
      X = batch[0]  # inputs
      Y = batch[1]  # correct class/label/politics
      optimizer.zero_grad()
      oupt = net(X)
      loss_val = loss_func(oupt, Y)  # a tensor
      epoch_loss += loss_val.item()  # accumulate
      loss_val.backward()
      optimizer.step()

    if epoch % le == 0:
      print("epoch = %5d  |  loss = %10.4f" % \
        (epoch, epoch_loss))
  print("Training done ") 
  return net 

# -----------------------------------------------------------

def create_params(seed=1):
  # n_hid, activation; opt, lr, bs, max_ep
  rnd = np.random.RandomState(seed)

  n_hid = rnd.randint(6, 21)  # [6, 20]
  activation = ['tanh', 'relu'][rnd.randint(0,2)]

  opt = ['sgd', 'adam'][rnd.randint(0,2)]
  lr = rnd.uniform(low=0.001, high=0.10)
  bs = rnd.randint(6, 16)
  max_ep = rnd.randint(200, 1000)

  return (n_hid, activation, opt, lr, bs, max_ep)
  
# -----------------------------------------------------------

def search_params(ds):
  # using Datset ds
  # loop n times
  #  create random arch and train hyperparams
  #  use arch params to create net
  #  use train params to train net
  #  evaluate trained net
  #  log params and eval metric to file
  # end-loop
  # analyze log offline

  max_trials = 6
  for i in range(max_trials):
    print("\nSearch trial " + str(i))
    (n_hid, activation, opt, lr, bs, max_ep) = \
      create_params(seed=i*2)
    print((n_hid, activation, opt, lr, bs, max_ep))

    net = Net(n_hid, activation).to(device)
    net.train()

    net = train(net, ds, opt, lr, bs, max_ep, le=200)

    net.eval()
    error = overall_loss_3(net, ds, n_class=3).item()
    acc = accuracy(net, ds) 
    print("acc = %0.4f  error = %0.4f " % (acc, error))
    # log params, error, accuracy here

  return 0

# -----------------------------------------------------------

def main():
  # 0. get started
  print("\nBegin People hyperparameter random search ")
  T.manual_seed(1)
  np.random.seed(1)
  
  # 1. create Dataset objects
  print("\nCreating People Datasets ")

  train_file = ".\\Data\\people_train.txt"
  train_ds = PeopleDataset(train_file)  # 200 rows

  test_file = ".\\Data\\people_test.txt"
  test_ds = PeopleDataset(test_file)    # 40 rows

  # 2. search for good hyperparameters
  search_params(train_ds)

# -----------------------------------------------------------

  print("\nEnd People hyperparameter random search ")

if __name__ == "__main__":
  main()
Posted in PyTorch | Leave a comment

Yet Another Implementation of a softargmax() Function for PyTorch

Suppose you have a vector/tensor t = [5.5, 2.2, 7.7, 9.9, 8.8]. The regular argmax(t) gives the index of the largest value in the tensor. Because the largest value is 9.9 at [3], argmax(t) = 3.

In some rare scenarios, it’s useful to have a variation of argmax() that is differentiable, usually called softargmax(), not to be confused with regular softmax(). For the example above softargmax(t) = 3.000000 — the result is a float rather than integer.

Here’s one possible implementation of a simple softargmax() for PyTorch:

# softargmax_demo.py

import torch

def softargmax(x):
  # crude: assumes max value is unique
  beta = 100.0
  xx = beta  * x
  sm = torch.nn.functional.softmax(xx, dim=-1)
  indices = torch.arange(len(x))
  y = torch.mul(indices, sm)
  result = torch.sum(y)
  return result
  
print("\nBegin PyTorch softargmax demo ")

t = torch.tensor([5.5, 2.2, 7.7, 9.9, 8.8],
  dtype=torch.float32)
print("\nSource tensor: ")
print(t)

sam = softargmax(t)
print("\nValue of softargmax(): ")
print(sam)

print("\nEnd ")

This simple implementation doesn’t work if there are two or more instances of the largest value. This “feature” is also true of the built-in PyTorch softmax() function.

The input is [5.5, 2.2, 7.7, 9.9, 8.8]. It is multiplied by an arbitrary large value beta = 100 giving [550., 220., 770., 990., 880.]. When you take the regaular softmax you get [0., 0., 0., 1., 0.] — the 0 values are not exactly 0 but very close. Likewise there’ll be one value that’s almost 1 but not quite. The indices are [0, 1, 2, 3, 4]. When you multiply the indices times the softmax values, you get [0., 0., 0., 3., 0.]. Then the sum is very close to 3.0.

Somewhat unfortunately, because the result of softargmax() is a float, not an integer, you can’t use the result as an index into the source vector/tensor. You can cast the softargmax() result to type integer, but that operation is not differentiable which defeats the purpose of softargmax() in many scenarios.



The softargmax() function is kind of an approximation to argmax(). In science fiction movies, robots are kind of approximations to humans. Here are four movie posters where a robot is holding a woman for some reason.

In “Tobor the Great” (1954), the robot was designed to be sent into space. In “Forbidden Planet” (1956), Robbie the robot was built by the human scientist Morbius using alien Krell technology. In “The Day the Earth Stood Still” (1951), Gort is a kind of bodyguard for good alien Klaatu. In “The Colossus of New York” (1958), a scientist transfers the brain of his dead son into a large robot, with bad consequences.


Posted in PyTorch | 1 Comment

Hyperparameter Tuning Using Evolutionary Optimization

All neural systems have hyperparameter values that must be determined by trial and error. Architecture hyperparameters include things like number of hidden layers, and hidden node activation. Training hyperparameters include things like learning rate and batch size. The system random number seed is a kind of a combined architecture-training hyperparameter because the seed determines initial weight values and order of processing during training.

In many scenarios, manually tuning hyperparameter values is feasible. But for large, complex neural systems, it’s sometimes necessary to programmatically find good hyperparameter values. Random search is simple but is inefficient. In grid search, you set up about 4 to 8 possible values for each hyperparameter, for example learn_rates = [0.001, 0.01, 0.05, 0.10] and bat_sizes = [4, 8, 16, 32], and then try all possible combinations. However, in some scenarios there are just too many combinations. If you have 10 hyperparameters and each can be one of 8 possible values, then there are 8 * 8 * . . 8 = 8^10 = 1,073,741,824 combinations of values.

For serious neural systems, I often use evolutionary optimization to search through all possible combinations of hyperparameter values. In very high-level pseudo-code:

create a population of solutions
loop max_generations times
  pick two parent solutions
  make child solution (crossover)
  mutate child
  use child to create, train, eval network
  replace weak solution with child
end-loop
return best solution found

I put together a demo. I set up a synthetic scenario with 10 hypothetical hyperparameters (random seed, number hidden nodes, batch size, etc.) where each hyperparameter has 8 possible values. Each possible hyperparameter value is an index from 0 to 7 into an array of actual values. I created a dummy error function that is optimal for a solution of [0, 0, 0, 0, 0, 0, 0, 0, 0, 0].

Evolutionary optimization is a meta-heuristic with a huge number of concrete implementations — every step in the pseudo-code above can be implemented in dozens of different ways.

For my demo, I used a population size of 6 possible solutions. To pick two parents, I pick one randomly from the best 3 and the other parent randomly from the worst 3. To make a child solution, I use a fixed crossover point at the middle of parent and child, which is at [5], and use the left half of parent1 and the right half of parent2. For example:

parent1 = [2 4 6 8 0 1 3 5 7 7]
parent2 = [3 3 3 3 3 6 6 6 6 6]

child   = [2 4 6 8 0 6 6 6 6 6]

To mutate, I select a random hyperparameter and then randomly add 1 or subtract 1. To replace a weak solution with the child, I randomly select one of the 3 worst solutions in the population.

The demo code found a very good solution of [1 0 1 1 0 0 0 0 2 0] in just 60 iterations/generations of the evolutionary algorithm.

My demo program uses a nested class definition, which is a rare technique for me. The outer primary class is Tuner. It has a nested Individual class that holds a solution (array of values) and its corresponding error. I place these together so that an array of Individual objects (the population) can be sorted by error. I could have used parallel arrays but the nested class seemed to make sense to me.

Good fun.



The evolution of science fiction art has always interested me. Artist Paul Lehr (1930-1998) did many book and magazines covers in the 1960s and 1970s. Lehr had a style that bridged the early pulp styles of the 1930s and the photo-based styles of the 1980s. I haven’t read any of these three novels. Maybe someday.


Demo code. Replace “lt”, “gt”, “lte”, “gte” with Boolean operator symbols.

# evolutionary_hyperparameter.py

import numpy as np

# n_hp = 10  10 hyperparameters
# n_hpv = 8  each hyperparam has 8 possible values
# n_pop = 6  population size


# -----------------------------------------------------------

class Tuner():

  # ---------------------------------------------------------
  class Individual():
    def __init__(self, n_hp, n_hpv, seed):
      # like [7,0,3,6,2,5,0,0,1,4] - 10 hyperparams in [0,7]
      self.rnd = np.random.RandomState(seed)
      self.soln = np.zeros(n_hp, dtype=np.int64)
      for i in range(n_hp):
        self.soln[i] = self.rnd.randint(0,n_hpv)
      self.err = 0.0

    def __lt__(self, other):  # sorting
      return self.err "lt" other.err

    def __eq__(self, other):
      return self.err == other.err

    def show(self):
      print(self.soln, end="")
      print(" | ", end="")
      print(self.err)
  # ---------------------------------------------------------

  def __init__(self, n_hp, n_hpv, n_pop, seed):
    self.rnd = np.random.RandomState(seed)
    self.n_hp = n_hp
    self.n_hpv = n_hpv
    self.n_pop = n_pop

    self.pop = np.empty((n_pop), dtype=object)
    for i in range(n_pop):
      self.pop[i] = self.Individual(self.n_hp, self.n_hpv, i)
      self.pop[i].err = self.compute_error(self.pop[i].soln)
    self.sort()

    # best found at any point in time
    self.best_soln = np.copy(self.pop[0].soln)
    self.best_err = self.pop[0].err

  def sort(self):
    self.pop = sorted(self.pop)  # Individual sort

  def show(self):
    for i in range(self.n_pop):
      self.pop[i].show()  # calls Individual.show()
    print("-----")
    print(self.best_soln, end="")
    print(" | ", end="")
    print(self.best_err)
 
  def compute_error(self, soln): 
    # in non-demo, soln would be used to create 
    # and train a neural system
    err = 0.0
    for i in range(self.n_hp):  # each hyperparam
      err += soln[i]            # small val is low error
    return err 

  def pick_parents(self):
    # pick indices of two solns
    first = self.rnd.randint(0, self.n_pop // 2)  # good
    second = self.rnd.randint(self.n_pop // 2, self.n_pop) 
    while second == first:
      second = self.rnd.randint(self.n_pop // 2, self.n_pop)
    flip = self.rnd.randint(2)  # 0 or 1
    if flip == 0:
      return (first, second)
    else:
      return (second, first)

  def crossover(self, i, j):
    child_soln = np.zeros(self.n_hp, dtype=np.int64)
    for k in range(0, self.n_hp//2):  # left half
      child_soln[k] = self.pop[i].soln[k]
    for k in range(self.n_hp//2, self.n_hp):  # right half
      child_soln[k] = self.pop[j].soln[k]
    return child_soln

  def mutate(self, soln):
    # a soln is an array with 10 cells, each in [0,7]
    idx = self.rnd.randint(0, self.n_hp)
    flip = self.rnd.randint(2)  # 0 or 1
    if flip == 0:
      soln[idx] -= 1
      if soln[idx] == -1:
        soln[idx] = self.n_hpv-1  # largest
    else:
      soln[idx] += 1
      if soln[idx] == self.n_hpv: # too much
        soln[idx] = 0

  def search(self, max_gen):
    for gen in range(max_gen):
      # 1. make a child soln using crossover
      (i, j) = self.pick_parents()
      child_soln = self.crossover(i, j)

      # 2. mutate child
      self.mutate(child_soln)
      child_err = self.compute_error(child_soln)

      # 2b. if new child has already been evaluated, 
      #     then continue
      
      # 3. is child a new best soln?
      if child_err "lt" self.best_err:
        print("New best soln found at gen " + str(gen))
        self.best_soln = np.copy(child_soln)
        self.best_err = child_err

      # 4. replace a weak soln with child
      idx = self.rnd.randint(self.n_pop // 2, self.n_pop)
      # print("replace idx = " + str(idx))
      for k in range(self.n_hp):
        self.pop[idx].soln[k] = child_soln[k]
      self.pop[idx].err = child_err

      # 5. sort solns from best to worst
      self.sort()
  
# -----------------------------------------------------------

# create a population of solns
# loop
#   pick two parent solns
#   make child
#   mutate child
#   evaluate child
#   replace weak soln with child
# end-loop

def main():
  print("\nBegin evolutionary hyperparameter tune/search ")
  print("Setting 10 hyperparams, each with 8 possible values ")
  print("Optimal param values all 0 ")

  n_hp = 10  # 10 hyperparameters like lrn_rate, etc.
  n_hpv = 8  # each hyperparam has 8 possible values
  n_pop = 6  # population size
  tuner = Tuner(n_hp, n_hpv, n_pop, seed=99)
  
  print("\nInitial population: ")
  tuner.show()

  print("\nBegin hyperparameter search \n")
  tuner.search(60)
  print("\nDone ")

  print("\nFinal population: ")
  tuner.show()

  print("\nEnd ")

if __name__ == "__main__":
  main()

Here’s a different version that uses static/global n_HP, n_HPV, rnd.

# evolutionary_hyperparameter_2.py

import numpy as np

# n_hp = 10  10 hyperparameters
# n_hpv = 8  each hyperparam has 8 possible values
# n_pop = 6  population size


# -----------------------------------------------------------

class Tuner():

  # static so that nested Individual can access
  g_rnd = None
  g_n_hp = 0
  g_n_hpv = 0
  
  def __init__(self, n_hp, n_hpv, n_pop, seed):
    Tuner.g_n_hp = n_hp
    Tuner.g_n_hpv = n_hpv
    Tuner.g_rnd = np.random.RandomState(seed)

    self.n_pop = n_pop
    self.pop = np.empty((n_pop), dtype=object)

    for i in range(n_pop):
      self.pop[i] = self.Individual()
      self.pop[i].err = self.compute_error(self.pop[i].soln)
    self.sort()

    # best found at any point in time
    self.best_soln = np.copy(self.pop[0].soln)
    self.best_err = self.pop[0].err

  def sort(self):
    self.pop = sorted(self.pop)  # Individual sort

  def show(self):
    for i in range(self.n_pop):
      self.pop[i].display()  # calls Individual.display()
    print("-----")
    print(self.best_soln, end="")
    print(" | ", end="")
    print(self.best_err)
 
  def compute_error(self, soln): 
    # in non-demo, soln would be used to create 
    # and train a neural system
    err = 0.0
    for i in range(len(soln)):  # each hyperparam
      err += soln[i]            # small val is low error
    return err 

  def pick_parents(self):
    # pick indices of two solns
    first = Tuner.g_rnd.randint(0, self.n_pop // 2)  # good
    second = Tuner.g_rnd.randint(self.n_pop // 2, self.n_pop)
    while second == first:
      second = Tuner.g_rnd.randint(self.n_pop // 2, self.n_pop)
    flip = Tuner.g_rnd.randint(2)  # 0 or 1
    if flip == 0:
      return (first, second)
    else:
      return (second, first)

  def crossover(self, i, j):
    child_soln = np.zeros(Tuner.g_n_hp, dtype=np.int64)
    for k in range(0, Tuner.g_n_hp//2):  # left half
      child_soln[k] = self.pop[i].soln[k]
    for k in range(Tuner.g_n_hp//2, Tuner.g_n_hp):  # right half
      child_soln[k] = self.pop[j].soln[k]
    return child_soln

  def mutate(self, soln):
    # a soln is an array with 10 cells, each in [0,7]
    idx = Tuner.g_rnd.randint(0, Tuner.g_n_hp)
    flip = Tuner.g_rnd.randint(2)  # 0 or 1
    if flip == 0:
      soln[idx] -= 1
      if soln[idx] == -1:
        soln[idx] = Tuner.g_n_hpv-1  # largest
    else:
      soln[idx] += 1
      if soln[idx] == Tuner.g_n_hpv: # too much
        soln[idx] = 0

  def search(self, max_gen):
    for gen in range(max_gen):
      # 1. make a child soln using crossover
      (i, j) = self.pick_parents()
      child_soln = self.crossover(i, j)

      # 2. mutate child
      self.mutate(child_soln)
      child_err = self.compute_error(child_soln)

      # 2b. if new child has already been evaluated, 
      #     then continue
      
      # 3. is child a new best soln?
      if child_err "lt" self.best_err:
        print("New best soln found at gen " + str(gen))
        self.best_soln = np.copy(child_soln)
        self.best_err = child_err

      # 4. replace a weak soln with child
      idx = Tuner.g_rnd.randint(self.n_pop // 2, self.n_pop)
      # print("replace idx = " + str(idx))
      for k in range(Tuner.g_n_hp):
        self.pop[idx].soln[k] = child_soln[k]
      self.pop[idx].err = child_err

      # 5. sort solns from best to worst
      self.sort()

  # ---------------------------------------------------------
  class Individual():
    def __init__(self):
      # like [7,0,3,6,2,5,0,0,1,4] - 10 hyperparams in [0,7]
      self.soln = np.zeros(Tuner.g_n_hp, dtype=np.int64)
      for i in range(len(self.soln)):
        self.soln[i] = Tuner.g_rnd.randint(0,Tuner.g_n_hpv)
      self.err = 0.0

    def __lt__(self, other):  # sorting
      return self.err "lt" other.err

    def __eq__(self, other):
      return self.err == other.err

    def display(self):
      print(self.soln, end="")
      print(" | ", end="")
      print(self.err)
  # ---------------------------------------------------------
  
# -----------------------------------------------------------

def main():
  print("\nBegin evolutionary hyperparameter tune/search ")
  print("Setting 10 hyperparams, each with 8 possible values ")
  print("Optimal param values all 0 ")

  n_hp = 10  # 10 hyperparameters like lrn_rate, etc.
  n_hpv = 8  # each hyperparam has 8 possible values
  n_pop = 6  # population size
  tuner = Tuner(n_hp, n_hpv, n_pop, seed=0)
  
  print("\nInitial population: ")
  tuner.show()

  print("\nBegin hyperparameter search \n")
  tuner.search(60)
  print("\nDone ")

  print("\nFinal population: ")
  tuner.show()

  print("\nEnd ")

if __name__ == "__main__":
  main()
Posted in Machine Learning | Leave a comment

NFL 2022 Week 9 Predictions – Zoltar Likes the Underdog Bears

Zoltar is my NFL football prediction computer program. It uses reinforcement learning and a neural network. Here are Zoltar’s predictions for week #9 of the 2022 season.

Zoltar:      eagles  by  OFF  dog =      texans    Vegas:      eagles  by   14
Zoltar:     falcons  by    1  dog =    chargers    Vegas:    chargers  by    3
Zoltar:    dolphins  by    0  dog =       bears    Vegas:    dolphins  by    5
Zoltar:     bengals  by    6  dog =    panthers    Vegas:     bengals  by    7
Zoltar:     packers  by    4  dog =       lions    Vegas:     packers  by  3.5
Zoltar:     raiders  by    2  dog =     jaguars    Vegas:     raiders  by  1.5
Zoltar:    patriots  by    4  dog =       colts    Vegas:    patriots  by  5.5
Zoltar:       bills  by  OFF  dog =        jets    Vegas:       bills  by   13
Zoltar:  commanders  by    2  dog =     vikings    Vegas:     vikings  by  3.5
Zoltar:   cardinals  by    6  dog =    seahawks    Vegas:   cardinals  by    2
Zoltar:  buccaneers  by    4  dog =        rams    Vegas:  buccaneers  by    3
Zoltar:      chiefs  by  OFF  dog =      titans    Vegas:      chiefs  by 12.5
Zoltar:      saints  by    4  dog =      ravens    Vegas:      ravens  by    3

Zoltar theoretically suggests betting when the Vegas line is “significantly” different from Zoltar’s prediction. In mid-season I sometimes use 3.0 points difference but for the first few weeks of the season I am a bit more conservative and use 4.0 points difference as the advice threshold criterion.

At the beginning of the season, because of Zoltar’s initialization (all teams regress to an average power rating) and other algorithms, Zoltar is very strongly biased towards Vegas underdogs. I probably need to fix this. For week #9 Zoltar likes three Vegas underdogs:

1. Zoltar likes Vegas underdog Bears against the Dolphins.
2. Zoltar likes Vegas underdog Commanders against the Vikings.
3. Zoltar likes Vegas underdog Saints against the Ravens.

For example, a bet on the underdog Bears against the Dolphins will pay off if the Bears win by any score, or if the favored Dolphins win but by less than 5.0 points. If a favored team wins by exactly the point spread, the wager is a push. This is why point spreads often have a 0.5 added — called “the hook” — to eliminate pushes.

According to Zoltar, there are three big mismatches that are beyond his programming to deal with: Eagles vs. Texans, Bills vs. Jets, Chiefs vs. Titans. We’ll see.

Theoretically, if you must bet $110 to win $100 (typical in Vegas) then you’ll make money if you predict at 53% accuracy or better. But realistically, you need to predict at 60% accuracy or better.

In week #8, against the Vegas point spread, Zoltar went only 2-3 (using 4.0 points as the advice threshold).

For the season, against the spread, Zoltar is 28-15 (~65% accuracy).

Just for fun, I track how well Zoltar does when just trying to predict just which team will win a game. This isn’t useful except for parlay betting. In week #8, just predicting the winning team, Zoltar went only 10-5 which is fair. Vegas was one game better, 11-4, at just predicting the winning team.

Zoltar sometimes predicts a 0-point margin of victory. There is one such game in week #9. In those situations, to pick a winner (only so I can track raw number of correct predictions) in the first few weeks of the season, Zoltar picks the home team to win. After that, Zoltar uses his algorithms to pick a winner.



Left: My system is named after the Zoltar fortune teller machine you can find in arcades.

Center: The Oracle fortune teller from about 1925 was produced by the Exhibit Supply Company (ESCO) of Chicago. ESCO was a dominant manufacturer of coin operated machines from about 1910 through the early 1950s.

Right: Uncle Sam’s Philosophy machine from the late 1800s was produced by the Automatic Amusement Co. of London. The spinner pointed to one of 18 rhyming predictions such as, “To-morrow going down the street, an ancient lover you will meet.” The machine reads “Consultation Fee 1D” which confused me until I discovered that D, from “denarius” (Latin for coin) was the old UK abbreviation for penny.


Posted in Zoltar | Leave a comment

“Binary Classification Using New PyTorch Best Practices, Part 2: Training, Accuracy, Predictions” in Visual Studio Magazine.

I wrote an article titled “Binary Classification Using New PyTorch Best Practices, Part 2: Training, Accuracy, Predictions” in the October 2022 edition of Microsoft Visual Studio Magazine. See https://visualstudiomagazine.com/articles/2022/10/14/binary-classification-using-pytorch-2.aspx.

The goal of a binary classification system is to predict the value of a variable that can only be one of two possibilities. The article is the second part of a two-part series. The articles explain a demo program where the goal is to predict a person’s sex (male = 0 or female = 1) from age, state (Michigan, Nebraska, Oklahoma), annual income, and political leaning (conservative, moderate, liberal).

I’ve written binary classification before. Compared to my previous articles, the new article presents current best practices and adds evaluating precision, recall, and F1 score.

The source code and data can be found at: https://jamesmccaffrey.wordpress.com/2022/09/23/binary-classification-using-pytorch-1-12-1-on-windows-10-11/.

I include these metrics because in many binary classification problem scenarios, plain classification accuracy isn’t a good metric. For example, if a dataset has 950 data items that are class 0 and 50 data items that are class 1, then a model that predicts class 0 for any input will score 95 percent accuracy. Precision and recall give alternate metrics that aren’t as sensitive to skewed data. The F1 score is the harmonic average of precision and recall.



“The Dark Crystal” (1982) is one of my favorite fantasy movies. Most of the species are somewhat gender-vague. Left: The kind Gelflings. Center: The evil Skeksis. Right: The rather lame Podlings.


Posted in PyTorch | Leave a comment

Transformer Based Dataset Similarity for MNIST

Computing a measure of the similarity of two datasets is a very difficult problem. Several months ago, I worked on a project and devised an algorithm based on a neural autoencoder. See jamesmccaffrey.wordpress.com/2021/04/02/dataset-similarity-sometimes-ideas-work-better-than-expected/.

I wondered if I could adapt my dataset similarity algorithm to use a neural transformer architecture instead of a neural autoencoder architecture. After a few hours of experimentation, I got a promising demo up and running. I used the UCI Digits dataset for my experiments. Each UCI data item is a crude grayscale image of a handwritten digit from ‘0’ to ‘9’. Each image has 8 by 8 = 64 pixels. Each pixel is a value between 0 (white) and 16 (black). The UCI Digits dataset is basically a small, simplified version of the well-known MNIST dataset. See jamesmccaffrey.wordpress.com/2022/10/03/the-distance-between-two-datasets-using-transformer-encoding/.

So, I then wondered if I could adapt the transformer based dataset similarity experiment from the UCI Digits dataset to the more difficult MNIST dataset. After a bit of work, I got an example running. For my demo, I compared 1000 items from the MNIST training data with 1000 items where 400 of the items had been randomized. The experiment worked as expected, but . . .

I observed that the demo took a long time to run. This is because transformer architecture runs in O(n^2) where n is the number of input items. In my original transformer experiment, the UCI Digits have n=65 inputs (64 pixel values and 1 label). But MNIST has n=785 inputs (784 pixels and 1 label).



The key component of my demo code is:

class AutoencoderTransformer(T.nn.Module):  # 785-xx-4-30-400-785 
  def __init__(self):
    # 785 numeric inputs: no exact word embedding equivalent
    # pseudo embed_dim = 2
    # seq_len = 785
    super(AutoencoderTransformer, self).__init__() 

    self.fc1 = T.nn.Linear(785, 785*2)  # pseudo-embedding
    self.fc2 = T.nn.Linear(785*2, 4)

    self.pos_enc = \
      PositionalEncoding(2, dropout=0.00)  # positional

    self.enc_layer = T.nn.TransformerEncoderLayer(d_model=2,
      nhead=2, dim_feedforward=100, dropout=0.0,
      batch_first=True)  # d_model divisible by nhead

    self.trans_enc = T.nn.TransformerEncoder(self.enc_layer,
      num_layers=6)

    self.dec1 = T.nn.Linear(4, 30) 
    self.dec2 = T.nn.Linear(30, 400)
    self.dec3 = T.nn.Linear(400, 785)

     # use default weight initialization
 
    self.latent_dim = 4

  def encode(self, x):           # x is [bs, 785]
    z = T.tanh(self.fc1(x))      # [bs, 1570]
    z = z.reshape(-1, 785, 2)    # [bs, 785, 2]
    z = self.pos_enc(z)          # [bs, 785, 2]
    z = self.trans_enc(z)        # [bs, 785, 2]
    z = z.reshape(-1, 785*2)     # [bs, 785]
    z = T.sigmoid(self.fc2(z))   # [bs, 4]
    return z

  def decode(self, x):
    z = T.tanh(self.dec1(x))     # [bs, 30]
    z = T.tanh(self.dec2(z))     # [bs, 400]
    z = T.sigmoid(self.dec3(z))  # [bs, 785]
    return z    

  def forward(self, x):            # x is [bs,785]
    z = self.encode(x)
    oupt = self.decode(z)
    return oupt

The details are quite complicated but briefly, the AutoencoderTransformer accepts as input 785 values (784 pixels and the associated class label), and encodes the input as a vector of four values. This is called the latent representation.

Each of the 1000 items in the reference P dataset are fed to the AutoencoderTransformer which results in a frequency distribution. Each of the 1000 items in the “other” Q dataset are fed to the AutoencoderTransformer which results in a second frequency distribution. The two distributions are compared using the Kullback-Leibler divergence which is a measure of how similar the P and Q datasets are.

Good fun (at least for me)



Three covers of “The Master Mind of Mars” (1927), the sixth book in the Mars series by author Edgar R. Burroughs. Left: The 1927 version by artist Frank R. Paul. Center: The 1963 version by artist Roy Krenkel. Right: The 1969 version by artist Robert Abbett. Without using any fancy similarity metrics, I’d say the left and center images are closest.


Posted in PyTorch, Transformers | Leave a comment

Hyperparameter Tuning Using Distributed Grid Descent

I was chit-chatting with one of my work colleagues about an algorithm he created for hyperparameter tuning. The algorithm is called Distributed Grid Descent (DGD).

Every neural prediction system has hyperparameters such as training learning rate, batch size, architecture number hidden nodes, hidden activation, and so on. Complex systems can have 10-20 hyperparameters.

In regular grid search, you set up candidate values, for example, lrn_rate = [0.001, 0.005, 0.010, 0.015, 0.020, 0.025] and bat_size = [4, 8, 10, 16, 32, 64] and max_epochs = [100, 500, 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000] and then try all possible combinations. The program random number generator seed value is sort of a hyperparameter — you should try each set of hyperparameters with typically about 10 different seed values and then average the result values (often mean squared error).



The DGD algorithm.


Hyperparameter tuning with candidate values is a type of combinatorial optimization problem.

It’s not always possible to try all possible grid hyperparameter value combinations because some systems have too many hyperparameter combinations and some systems have very long training times.

My colleague’s DGD algorithm searches through hyperparameter values. The DGD algorithm is a form of evolutionary optimization that uses only mutation to generate a new candidate hyperparameter set solution. Some evolutionary optimization algorithms use crossover in addition to mutation.

The DGD algorithm is explained in an appendix to the 2020 research paper “Working Memory Graphs” by R. Loynd et al. from the Proceedings of the 37th International Conference on Machine Learning.

Interesting stuff.



I got my love of combinatorial mathematics from playing poker when I was at Servite High School in Anaheim, California, when I should have been doing my Latin homework. We usually played at the house of my classmate Michael Ventriglia. Three random images from an Internet search for “poker face”.


Posted in Machine Learning | 1 Comment