Regression (People Income) Using PyTorch 1.12 on Windows 10/11

A regression problem is one where the goal is to predict a single numeric value. I decided to check my current PyTorch version (1.12.1-CPU) to make sure there were no breaking changes.

I used one of my standard examples where the goal is to predict a person’s annual income from their sex, age, state, and political leaning. My data looks like:

 1   0.24   1   0   0   0.2950   0   0   1
-1   0.39   0   0   1   0.5120   0   1   0
 1   0.63   0   1   0   0.7580   1   0   0
-1   0.36   1   0   0   0.4450   0   1   0
 1   0.27   0   1   0   0.2860   0   0   1
. . .

The tab-delimited fields are sex (male = -1, female = +1), age (divided by 100), state (Michigan = 100, Nebraska = 010, Oklahoma = 001), income (divided by 100,000), politics (conservative = 100, moderate = 010, liberal = 001). The data is synthetic. There are 200 training items and 40 test items.

For my demo, I created an 8-(10-10)-1 neural network with tanh() activation on the hidden nodes. I used explicit weight and bias initialization.

For training, I used Adam optimization with a fixed learning rate of 0.01, and mean squared error.

I implemented a program-defined accuracy() function where a correct income prediction is one that’s within a specified percentage of the true income. After training, using a 10% closeness percentage, my model scored 91.00% accuracy on the training data (182 of 200 correct), and 85.00% accuracy on the test data (34 of 40 correct).

Good fun.



Left: The board game “Careers” was first published in 1955. Players accumulate fame, happiness, and money. An unusual feature is that players start by setting their own victory conditions, such as 20 fame points, 10 happiness points, and 30 money points. The game is fun and interesting. Right: “Catan” was first published in 1995 and is wildly popular. The goal is to create wealth by building settlements and roads. I enjoy playing Catan a lot.


Demo code. Replace “lt” with Boolean operator symbol (my lame blog editor chokes on symbols).

# people_income.py
# predict income from sex, age, city, politics
# PyTorch 1.12.1-CPU Anaconda3-2020.02  Python 3.7.6
# Windows 10/11 

import numpy as np
import torch as T

device = T.device('cpu')  # apply to Tensor or Module

# -----------------------------------------------------------

class PeopleDataset(T.utils.data.Dataset):
  def __init__(self, src_file):
    # sex age   state   income   politics
    # -1  0.27  0 1 0   0.7610   0 0 1
    # +1  0.19  0 0 1   0.6550   1 0 0

    # tmp_x = np.loadtxt(src_file, usecols=[0,1,2,3,4,6,7,8],
    #   delimiter="\t", comments="#", dtype=np.float32)
    # tmp_y = np.loadtxt(src_file, usecols=5, delimiter="\t",
    #   comments="#", dtype=np.float32)
    # tmp_y = tmp_y.reshape(-1,1)  # 2D required

    all_xy = np.loadtxt(src_file, usecols=[0,1,2,3,4,5,6,7,8],
      delimiter="\t", comments="#", dtype=np.float32)
    tmp_x = all_xy[:,[0,1,2,3,4,6,7,8]]
    tmp_y = all_xy[:,5].reshape(-1,1)  # 2D required

    self.x_data = T.tensor(tmp_x, dtype=T.float32).to(device)
    self.y_data = T.tensor(tmp_y, dtype=T.float32).to(device)

  def __len__(self):
    return len(self.x_data)

  def __getitem__(self, idx):
    preds = self.x_data[idx]
    incom = self.y_data[idx] 
    return (preds, incom)  # as a tuple

# -----------------------------------------------------------

class Net(T.nn.Module):
  def __init__(self):
    super(Net, self).__init__()
    self.hid1 = T.nn.Linear(8, 10)  # 8-(10-10)-1
    self.hid2 = T.nn.Linear(10, 10)
    self.oupt = T.nn.Linear(10, 1)

    T.nn.init.xavier_uniform_(self.hid1.weight)
    T.nn.init.zeros_(self.hid1.bias)
    T.nn.init.xavier_uniform_(self.hid2.weight)
    T.nn.init.zeros_(self.hid2.bias)
    T.nn.init.xavier_uniform_(self.oupt.weight)
    T.nn.init.zeros_(self.oupt.bias)

  def forward(self, x):
    z = T.tanh(self.hid1(x))
    z = T.tanh(self.hid2(z))
    z = self.oupt(z)  # regression: no activation
    return z

# -----------------------------------------------------------

def accuracy(model, ds, pct_close):
  # assumes model.eval()
  # correct within pct of true income
  n_correct = 0; n_wrong = 0

  for i in range(len(ds)):
    X = ds[i][0]   # 2-d
    Y = ds[i][1]   # 2-d
    with T.no_grad():
      oupt = model(X)         # computed income

    if T.abs(oupt - Y) "lt" T.abs(pct_close * Y):
      n_correct += 1
    else:
      n_wrong += 1
  acc = (n_correct * 1.0) / (n_correct + n_wrong)
  return acc

# -----------------------------------------------------------

def accuracy_x(model, ds, pct_close):
  # all-at-once (quick)
  # assumes model.eval()
  X = ds.x_data  # all inputs
  Y = ds.y_data  # all targets
  n_items = len(X)
  with T.no_grad():
    pred = model(X)  # all predicted incomes
 
  n_correct = T.sum((T.abs(pred - Y) "lt" T.abs(pct_close * Y)))
  result = (n_correct.item() / n_items)  # scalar
  return result  

# -----------------------------------------------------------

def train(model, ds, bs, lr, me, le):
  # dataset, bat_size, lrn_rate, max_epochs, log interval
  train_ldr = T.utils.data.DataLoader(ds, batch_size=bs,
    shuffle=True)
  loss_func = T.nn.MSELoss()
  optimizer = T.optim.Adam(model.parameters(), lr=lr)

  for epoch in range(0, me):
    epoch_loss = 0.0  # for one full epoch

    for (b_idx, batch) in enumerate(train_ldr):
      X = batch[0]  # predictors
      y = batch[1]  # target income
      optimizer.zero_grad()
      oupt = model(X)
      loss_val = loss_func(oupt, y)  # a tensor
      epoch_loss += loss_val.item()  # accumulate
      loss_val.backward()  # compute gradients
      optimizer.step()     # update weights

    if epoch % le == 0:
      print("epoch = %4d  |  loss = %0.4f" % (epoch, epoch_loss)) 

# -----------------------------------------------------------

def main():
  # 0. get started
  print("\nBegin People predict income ")
  T.manual_seed(0)
  np.random.seed(0)
  
  # 1. create Dataset objects
  print("\nCreating People Dataset objects ")
  train_file = ".\\Data\\people_train.txt"
  train_ds = PeopleDataset(train_file)  # 200 rows

  test_file = ".\\Data\\people_test.txt"
  test_ds = PeopleDataset(test_file)  # 40 rows

  # bat_size = 10
  # train_ldr = T.utils.data.DataLoader(train_ds,
  #   batch_size=bat_size, shuffle=True)

  # 2. create network
  print("\nCreating 8-(10-10)-1 neural network ")
  net = Net().to(device)

# -----------------------------------------------------------

  # 3. train model
  print("\nbat_size = 10 ")
  print("loss = MSELoss() ")
  print("optimizer = Adam ")
  print("lrn_rate = 0.01 ")

  print("\nStarting training")
  net.train()
  train(net, train_ds, bs=10, lr=0.01, me=1000, le=100)
  print("Done ")

# -----------------------------------------------------------

  # 4. evaluate model accuracy
  print("\nComputing model accuracy (within 0.10 of true) ")
  net = net.eval()
  acc_train = accuracy(net, train_ds, 0.10)  # item-by-item
  print("Accuracy on train data = %0.4f" % acc_train)

  acc_test = accuracy_x(net, test_ds, 0.10)  # all-at-once
  print("Accuracy on test data = %0.4f" % acc_test)

# -----------------------------------------------------------

  # 5. make a prediction
  print("\nPredicting income for M 34 Oklahoma moderate: ")
  x = np.array([[-1, 0.34, 0,0,1,  0,1,0]],
    dtype=np.float32)
  x = T.tensor(x, dtype=T.float32).to(device) 

  with T.no_grad():
    pred_inc = net(x)
  pred_inc = pred_inc.item()  # scalar
  print("$%0.2f" % (pred_inc * 100_000))  # un-normalized

# -----------------------------------------------------------

  # 6. save model (state_dict approach)
  print("\nSaving trained model state")
  fn = ".\\Models\\people_income_model.pt"
  T.save(net.state_dict(), fn)

  # model = Net()
  # model.load_state_dict(T.load(fn))
  # use model to make prediction(s)

  print("\nEnd People income demo")

if __name__ == "__main__":
  main()

Training data. Replace comma characters with tabs.

# people_train.txt
#
# sex (-1 = male, 1 = female), age / 100,
# state (michigan = 100, nebraska = 010, oklahoma = 001),
# income / 100_000,
# politics (conservative = 100, moderate = 010, liberal = 001)
#
1,0.24,1,0,0,0.2950,0,0,1
-1,0.39,0,0,1,0.5120,0,1,0
1,0.63,0,1,0,0.7580,1,0,0
-1,0.36,1,0,0,0.4450,0,1,0
1,0.27,0,1,0,0.2860,0,0,1
1,0.50,0,1,0,0.5650,0,1,0
1,0.50,0,0,1,0.5500,0,1,0
-1,0.19,0,0,1,0.3270,1,0,0
1,0.22,0,1,0,0.2770,0,1,0
-1,0.39,0,0,1,0.4710,0,0,1
1,0.34,1,0,0,0.3940,0,1,0
-1,0.22,1,0,0,0.3350,1,0,0
1,0.35,0,0,1,0.3520,0,0,1
-1,0.33,0,1,0,0.4640,0,1,0
1,0.45,0,1,0,0.5410,0,1,0
1,0.42,0,1,0,0.5070,0,1,0
-1,0.33,0,1,0,0.4680,0,1,0
1,0.25,0,0,1,0.3000,0,1,0
-1,0.31,0,1,0,0.4640,1,0,0
1,0.27,1,0,0,0.3250,0,0,1
1,0.48,1,0,0,0.5400,0,1,0
-1,0.64,0,1,0,0.7130,0,0,1
1,0.61,0,1,0,0.7240,1,0,0
1,0.54,0,0,1,0.6100,1,0,0
1,0.29,1,0,0,0.3630,1,0,0
1,0.50,0,0,1,0.5500,0,1,0
1,0.55,0,0,1,0.6250,1,0,0
1,0.40,1,0,0,0.5240,1,0,0
1,0.22,1,0,0,0.2360,0,0,1
1,0.68,0,1,0,0.7840,1,0,0
-1,0.60,1,0,0,0.7170,0,0,1
-1,0.34,0,0,1,0.4650,0,1,0
-1,0.25,0,0,1,0.3710,1,0,0
-1,0.31,0,1,0,0.4890,0,1,0
1,0.43,0,0,1,0.4800,0,1,0
1,0.58,0,1,0,0.6540,0,0,1
-1,0.55,0,1,0,0.6070,0,0,1
-1,0.43,0,1,0,0.5110,0,1,0
-1,0.43,0,0,1,0.5320,0,1,0
-1,0.21,1,0,0,0.3720,1,0,0
1,0.55,0,0,1,0.6460,1,0,0
1,0.64,0,1,0,0.7480,1,0,0
-1,0.41,1,0,0,0.5880,0,1,0
1,0.64,0,0,1,0.7270,1,0,0
-1,0.56,0,0,1,0.6660,0,0,1
1,0.31,0,0,1,0.3600,0,1,0
-1,0.65,0,0,1,0.7010,0,0,1
1,0.55,0,0,1,0.6430,1,0,0
-1,0.25,1,0,0,0.4030,1,0,0
1,0.46,0,0,1,0.5100,0,1,0
-1,0.36,1,0,0,0.5350,1,0,0
1,0.52,0,1,0,0.5810,0,1,0
1,0.61,0,0,1,0.6790,1,0,0
1,0.57,0,0,1,0.6570,1,0,0
-1,0.46,0,1,0,0.5260,0,1,0
-1,0.62,1,0,0,0.6680,0,0,1
1,0.55,0,0,1,0.6270,1,0,0
-1,0.22,0,0,1,0.2770,0,1,0
-1,0.50,1,0,0,0.6290,1,0,0
-1,0.32,0,1,0,0.4180,0,1,0
-1,0.21,0,0,1,0.3560,1,0,0
1,0.44,0,1,0,0.5200,0,1,0
1,0.46,0,1,0,0.5170,0,1,0
1,0.62,0,1,0,0.6970,1,0,0
1,0.57,0,1,0,0.6640,1,0,0
-1,0.67,0,0,1,0.7580,0,0,1
1,0.29,1,0,0,0.3430,0,0,1
1,0.53,1,0,0,0.6010,1,0,0
-1,0.44,1,0,0,0.5480,0,1,0
1,0.46,0,1,0,0.5230,0,1,0
-1,0.20,0,1,0,0.3010,0,1,0
-1,0.38,1,0,0,0.5350,0,1,0
1,0.50,0,1,0,0.5860,0,1,0
1,0.33,0,1,0,0.4250,0,1,0
-1,0.33,0,1,0,0.3930,0,1,0
1,0.26,0,1,0,0.4040,1,0,0
1,0.58,1,0,0,0.7070,1,0,0
1,0.43,0,0,1,0.4800,0,1,0
-1,0.46,1,0,0,0.6440,1,0,0
1,0.60,1,0,0,0.7170,1,0,0
-1,0.42,1,0,0,0.4890,0,1,0
-1,0.56,0,0,1,0.5640,0,0,1
-1,0.62,0,1,0,0.6630,0,0,1
-1,0.50,1,0,0,0.6480,0,1,0
1,0.47,0,0,1,0.5200,0,1,0
-1,0.67,0,1,0,0.8040,0,0,1
-1,0.40,0,0,1,0.5040,0,1,0
1,0.42,0,1,0,0.4840,0,1,0
1,0.64,1,0,0,0.7200,1,0,0
-1,0.47,1,0,0,0.5870,0,0,1
1,0.45,0,1,0,0.5280,0,1,0
-1,0.25,0,0,1,0.4090,1,0,0
1,0.38,1,0,0,0.4840,1,0,0
1,0.55,0,0,1,0.6000,0,1,0
-1,0.44,1,0,0,0.6060,0,1,0
1,0.33,1,0,0,0.4100,0,1,0
1,0.34,0,0,1,0.3900,0,1,0
1,0.27,0,1,0,0.3370,0,0,1
1,0.32,0,1,0,0.4070,0,1,0
1,0.42,0,0,1,0.4700,0,1,0
-1,0.24,0,0,1,0.4030,1,0,0
1,0.42,0,1,0,0.5030,0,1,0
1,0.25,0,0,1,0.2800,0,0,1
1,0.51,0,1,0,0.5800,0,1,0
-1,0.55,0,1,0,0.6350,0,0,1
1,0.44,1,0,0,0.4780,0,0,1
-1,0.18,1,0,0,0.3980,1,0,0
-1,0.67,0,1,0,0.7160,0,0,1
1,0.45,0,0,1,0.5000,0,1,0
1,0.48,1,0,0,0.5580,0,1,0
-1,0.25,0,1,0,0.3900,0,1,0
-1,0.67,1,0,0,0.7830,0,1,0
1,0.37,0,0,1,0.4200,0,1,0
-1,0.32,1,0,0,0.4270,0,1,0
1,0.48,1,0,0,0.5700,0,1,0
-1,0.66,0,0,1,0.7500,0,0,1
1,0.61,1,0,0,0.7000,1,0,0
-1,0.58,0,0,1,0.6890,0,1,0
1,0.19,1,0,0,0.2400,0,0,1
1,0.38,0,0,1,0.4300,0,1,0
-1,0.27,1,0,0,0.3640,0,1,0
1,0.42,1,0,0,0.4800,0,1,0
1,0.60,1,0,0,0.7130,1,0,0
-1,0.27,0,0,1,0.3480,1,0,0
1,0.29,0,1,0,0.3710,1,0,0
-1,0.43,1,0,0,0.5670,0,1,0
1,0.48,1,0,0,0.5670,0,1,0
1,0.27,0,0,1,0.2940,0,0,1
-1,0.44,1,0,0,0.5520,1,0,0
1,0.23,0,1,0,0.2630,0,0,1
-1,0.36,0,1,0,0.5300,0,0,1
1,0.64,0,0,1,0.7250,1,0,0
1,0.29,0,0,1,0.3000,0,0,1
-1,0.33,1,0,0,0.4930,0,1,0
-1,0.66,0,1,0,0.7500,0,0,1
-1,0.21,0,0,1,0.3430,1,0,0
1,0.27,1,0,0,0.3270,0,0,1
1,0.29,1,0,0,0.3180,0,0,1
-1,0.31,1,0,0,0.4860,0,1,0
1,0.36,0,0,1,0.4100,0,1,0
1,0.49,0,1,0,0.5570,0,1,0
-1,0.28,1,0,0,0.3840,1,0,0
-1,0.43,0,0,1,0.5660,0,1,0
-1,0.46,0,1,0,0.5880,0,1,0
1,0.57,1,0,0,0.6980,1,0,0
-1,0.52,0,0,1,0.5940,0,1,0
-1,0.31,0,0,1,0.4350,0,1,0
-1,0.55,1,0,0,0.6200,0,0,1
1,0.50,1,0,0,0.5640,0,1,0
1,0.48,0,1,0,0.5590,0,1,0
-1,0.22,0,0,1,0.3450,1,0,0
1,0.59,0,0,1,0.6670,1,0,0
1,0.34,1,0,0,0.4280,0,0,1
-1,0.64,1,0,0,0.7720,0,0,1
1,0.29,0,0,1,0.3350,0,0,1
-1,0.34,0,1,0,0.4320,0,1,0
-1,0.61,1,0,0,0.7500,0,0,1
1,0.64,0,0,1,0.7110,1,0,0
-1,0.29,1,0,0,0.4130,1,0,0
1,0.63,0,1,0,0.7060,1,0,0
-1,0.29,0,1,0,0.4000,1,0,0
-1,0.51,1,0,0,0.6270,0,1,0
-1,0.24,0,0,1,0.3770,1,0,0
1,0.48,0,1,0,0.5750,0,1,0
1,0.18,1,0,0,0.2740,1,0,0
1,0.18,1,0,0,0.2030,0,0,1
1,0.33,0,1,0,0.3820,0,0,1
-1,0.20,0,0,1,0.3480,1,0,0
1,0.29,0,0,1,0.3300,0,0,1
-1,0.44,0,0,1,0.6300,1,0,0
-1,0.65,0,0,1,0.8180,1,0,0
-1,0.56,1,0,0,0.6370,0,0,1
-1,0.52,0,0,1,0.5840,0,1,0
-1,0.29,0,1,0,0.4860,1,0,0
-1,0.47,0,1,0,0.5890,0,1,0
1,0.68,1,0,0,0.7260,0,0,1
1,0.31,0,0,1,0.3600,0,1,0
1,0.61,0,1,0,0.6250,0,0,1
1,0.19,0,1,0,0.2150,0,0,1
1,0.38,0,0,1,0.4300,0,1,0
-1,0.26,1,0,0,0.4230,1,0,0
1,0.61,0,1,0,0.6740,1,0,0
1,0.40,1,0,0,0.4650,0,1,0
-1,0.49,1,0,0,0.6520,0,1,0
1,0.56,1,0,0,0.6750,1,0,0
-1,0.48,0,1,0,0.6600,0,1,0
1,0.52,1,0,0,0.5630,0,0,1
-1,0.18,1,0,0,0.2980,1,0,0
-1,0.56,0,0,1,0.5930,0,0,1
-1,0.52,0,1,0,0.6440,0,1,0
-1,0.18,0,1,0,0.2860,0,1,0
-1,0.58,1,0,0,0.6620,0,0,1
-1,0.39,0,1,0,0.5510,0,1,0
-1,0.46,1,0,0,0.6290,0,1,0
-1,0.40,0,1,0,0.4620,0,1,0
-1,0.60,1,0,0,0.7270,0,0,1
1,0.36,0,1,0,0.4070,0,0,1
1,0.44,1,0,0,0.5230,0,1,0
1,0.28,1,0,0,0.3130,0,0,1
1,0.54,0,0,1,0.6260,1,0,0

Test data.

# people_test.txt
#
-1,0.51,1,0,0,0.6120,0,1,0
-1,0.32,0,1,0,0.4610,0,1,0
1,0.55,1,0,0,0.6270,1,0,0
1,0.25,0,0,1,0.2620,0,0,1
1,0.33,0,0,1,0.3730,0,0,1
-1,0.29,0,1,0,0.4620,1,0,0
1,0.65,1,0,0,0.7270,1,0,0
-1,0.43,0,1,0,0.5140,0,1,0
-1,0.54,0,1,0,0.6480,0,0,1
1,0.61,0,1,0,0.7270,1,0,0
1,0.52,0,1,0,0.6360,1,0,0
1,0.3,0,1,0,0.3350,0,0,1
1,0.29,1,0,0,0.3140,0,0,1
-1,0.47,0,0,1,0.5940,0,1,0
1,0.39,0,1,0,0.4780,0,1,0
1,0.47,0,0,1,0.5200,0,1,0
-1,0.49,1,0,0,0.5860,0,1,0
-1,0.63,0,0,1,0.6740,0,0,1
-1,0.3,1,0,0,0.3920,1,0,0
-1,0.61,0,0,1,0.6960,0,0,1
-1,0.47,0,0,1,0.5870,0,1,0
1,0.3,0,0,1,0.3450,0,0,1
-1,0.51,0,0,1,0.5800,0,1,0
-1,0.24,1,0,0,0.3880,0,1,0
-1,0.49,1,0,0,0.6450,0,1,0
1,0.66,0,0,1,0.7450,1,0,0
-1,0.65,1,0,0,0.7690,1,0,0
-1,0.46,0,1,0,0.5800,1,0,0
-1,0.45,0,0,1,0.5180,0,1,0
-1,0.47,1,0,0,0.6360,1,0,0
-1,0.29,1,0,0,0.4480,1,0,0
-1,0.57,0,0,1,0.6930,0,0,1
-1,0.2,1,0,0,0.2870,0,0,1
-1,0.35,1,0,0,0.4340,0,1,0
-1,0.61,0,0,1,0.6700,0,0,1
-1,0.31,0,0,1,0.3730,0,1,0
1,0.18,1,0,0,0.2080,0,0,1
1,0.26,0,0,1,0.2920,0,0,1
-1,0.28,1,0,0,0.3640,0,0,1
-1,0.59,0,0,1,0.6940,0,0,1
Posted in PyTorch | 3 Comments

An Example of Sensitivity Analysis for a PyTorch Model

In sensitivity analysis, you examine the effects of changing input values to a machine learning prediction model. The classic example is looking at a model that predicts the credit worthiness of a loan applicant based on things like income, debt, age, and so on. If the model predicts 0 = decline loan, you might want to examine the effect of the debt predictor variable to see at what point the prediction changes to 1 = approve loan. You’d typically want to know the smallest amount of reduction in debt that would generate an approve-loan result.

Before I go any further, let me point out that sensitivity analysis has a serious drawback. I explain it below.

Sensitivity analysis is closely related to, and in fact is pretty much the same as, what-if analysis. For example, “What if the income value is increased by $1,000?” And sensitivity analysis is a form of model interpretability — understanding how a model works.

I implemented a demo using PyTorch. The demo model predicts a person’s political leaning (conservative, moderate, liberal) based on sex (M, F), age, state (Michigan, Nebraska, Oklahoma), and annual income. After training, I set up an input of Male, 30 years old, Oklahoma, $50,000. The predicted political leaning in pseudo-probabilities is [[0.6905 0.3049 0.0047]], which is class 0 (conservative).

Then I varied the age input value from 0.00 to 0.75 and examined the results:

Age       Pseudo-Probabilities    Predicted Politics
--------------------------------------------------------
0.00  |  [[0.9956 0.0044 0.    ]]  |  0
0.05  |  [[0.9928 0.0072 0.    ]]  |  0
0.10  |  [[0.9868 0.0132 0.    ]]  |  0
0.15  |  [[0.9728 0.0272 0.0001]]  |  0
0.20  |  [[0.9381 0.0616 0.0002]]  |  0
0.25  |  [[0.8552 0.1438 0.001 ]]  |  0
0.30  |  [[0.6905 0.3049 0.0047]]  |  0
0.35  |  [[0.4666 0.5145 0.0189]]  |  1
0.40  |  [[0.2732 0.6661 0.0607]]  |  1
0.45  |  [[0.1517 0.6913 0.157 ]]  |  1
0.50  |  [[0.0831 0.5905 0.3263]]  |  1
0.55  |  [[0.0445 0.4168 0.5387]]  |  2
0.60  |  [[0.0234 0.2527 0.7239]]  |  2
0.65  |  [[0.0126 0.1427 0.8447]]  |  2
0.70  |  [[0.0072 0.0813 0.9115]]  |  2
0.75  |  [[0.0045 0.0491 0.9464]]  |  2

The sensitivity analysis shows that age = 30 is a somewhat of a critical value because at age 35 through age 50, the predicted politics leaning switches from conservative to moderate. The analysis shows there’s somewhat of a linear relationship between age and political leaning — but that relationship is only for Male, Oklahoma, $50,000 — it may not hold for other combinations of input values.

Note: If you made a graph of this data with values of age on the x-axis, the result is called an Individual Conditional Expectation (ICE) Plot. You are examining the effect of changing age for a specific individual data item — Male, 30 years old, Oklahoma, $50,000.

Based on these results, a logical next step would be to examine age values between 0.30 (30 years old) and 0.39 at a more granular level. For example:

0.30  |  [[0.6905 0.3049 0.0047]]  |  0
0.31  |  [[0.6479 0.3458 0.0063]]  |  0
0.32  |  [[0.6034 0.3882 0.0084]]  |  0
0.33  |  [[0.5578 0.4312 0.0111]]  |  0
0.34  |  [[0.5119 0.4736 0.0145]]  |  0
0.35  |  [[0.4666 0.5145 0.0189]]  |  1
0.36  |  [[0.4229 0.5529 0.0243]]  |  1
0.37  |  [[0.3812 0.5878 0.0309]]  |  1
0.38  |  [[0.3422 0.6187 0.0391]]  |  1
0.39  |  [[0.3062 0.6449 0.0489]]  |  1

The data suggests that ages 34-35 are some kind of boundary.

There aren’t any really good general purpose sensitivity analysis tools because how you perform an analysis is highly problem-dependent.

A serious drawback of sensitivity analysis of one input variable is that it doesn’t take into account interaction effects with other input variables. For example, in the analysis above, the combined effects of age and gender could be completely different than the effect of age by itself. Exploring all combinations of input variables isn’t practical in most cases. This drawback is so serious that I rarely use sensitivity analysis because the results could be very misleading.

A technique called Shapley Value examines the effect of changing all possible combinations of input variables. In other words, Shapley Value analysis is combinatorial sensitivity analysis. See: jamesmccaffrey.wordpress.com/2020/11/09/example-of-the-shapley-value-for-machine-learning-interpretability/.

Note: A technique that’s closely related to sensitivity analysis is called making a Partial Dependence Plot (PDP). PDP is used most often for regression models where the prediction is a numeric value, such as predicting a person’s income. In a PDP you pick a predictor of interest, say age, then find all the possible age values in the data — perhaps (18, 19, 20, 23, 24, 29, 30, . . 56). For each possible age value you do a simulation, say 10,000 trials, where you pick random (but legal) values for the other predictors, and compute the average prediction value. PDPs don’t take feature interaction into account so PDPs are not all that great.



The word “sensitive” when applied to people means someone who has “a delicate appreciation of others’ feelings”. Here are two science fiction movies where the alien is definitely not sensitive. Left: The Martian mastermind from “Invaders from Mars” (1953), one of my favorite films of the 1950s. Right: The alien from “Alien” (1979) — the movie scared me a lot when I first saw it.


Demo code. Replace “lt”, “gt”, “lte”, “gte” with Boolean operator symbols. The data can be found at jamesmccaffrey.wordpress.com/2022/09/01/multi-class-classification-using-pytorch-1-12-1-on-windows-10-11/.

# people_sensitivity.py
# predict politics type from sex, age, state, income
# PyTorch 1.12.1-CPU Anaconda3-2020.02  Python 3.7.6
# Windows 10/11 

import numpy as np
import torch as T
device = T.device('cpu')  # apply to Tensor or Module

# -----------------------------------------------------------

class PeopleDataset(T.utils.data.Dataset):
  # sex  age    state    income   politics
  # -1   0.27   0  1  0   0.7610   2
  # +1   0.19   0  0  1   0.6550   0
  # sex: -1 = male, +1 = female
  # state: michigan, nebraska, oklahoma
  # politics: conservative, moderate, liberal

  def __init__(self, src_file):
    all_xy = np.loadtxt(src_file, usecols=range(0,7),
      delimiter="\t", comments="#", dtype=np.float32)
    tmp_x = all_xy[:,0:6]   # cols [0,6) = [0,5]
    tmp_y = all_xy[:,6]     # 1-D

    self.x_data = T.tensor(tmp_x, 
      dtype=T.float32).to(device)
    self.y_data = T.tensor(tmp_y,
      dtype=T.int64).to(device)  # 1-D

  def __len__(self):
    return len(self.x_data)

  def __getitem__(self, idx):
    preds = self.x_data[idx]
    trgts = self.y_data[idx] 
    return preds, trgts  # as a Tuple

# -----------------------------------------------------------

class Net(T.nn.Module):
  def __init__(self):
    super(Net, self).__init__()
    self.hid1 = T.nn.Linear(6, 10)  # 6-(10-10)-3
    self.hid2 = T.nn.Linear(10, 10)
    self.oupt = T.nn.Linear(10, 3)

    T.nn.init.xavier_uniform_(self.hid1.weight)
    T.nn.init.zeros_(self.hid1.bias)
    T.nn.init.xavier_uniform_(self.hid2.weight)
    T.nn.init.zeros_(self.hid2.bias)
    T.nn.init.xavier_uniform_(self.oupt.weight)
    T.nn.init.zeros_(self.oupt.bias)

  def forward(self, x):
    z = T.tanh(self.hid1(x))
    z = T.tanh(self.hid2(z))
    z = T.log_softmax(self.oupt(z), dim=1)  # NLLLoss() 
    return z

# -----------------------------------------------------------

def accuracy(model, ds):
  # assumes model.eval()
  # item-by-item version
  n_correct = 0; n_wrong = 0
  for i in range(len(ds)):
    X = ds[i][0].reshape(1,-1)  # make it a batch
    Y = ds[i][1].reshape(1)  # 0 1 or 2, 1D
    with T.no_grad():
      oupt = model(X)  # logits form

    big_idx = T.argmax(oupt)  # 0 or 1 or 2
    if big_idx == Y:
      n_correct += 1
    else:
      n_wrong += 1

  acc = (n_correct * 1.0) / (n_correct + n_wrong)
  return acc

# -----------------------------------------------------------

def main():
  # 0. get started
  print("\nBegin People predict politics sensitivity ")
  T.manual_seed(1)
  np.random.seed(1)
  
  # 1. create DataLoader objects
  print("\nCreating People Datasets ")

  train_file = ".\\Data\\people_train.txt"
  train_ds = PeopleDataset(train_file)  # 200 rows

  test_file = ".\\Data\\people_test.txt"
  test_ds = PeopleDataset(test_file)    # 40 rows

  bat_size = 10
  train_ldr = T.utils.data.DataLoader(train_ds,
    batch_size=bat_size, shuffle=True)

# -----------------------------------------------------------

  # 2. create network
  print("\nCreating 6-(10-10)-3 neural network ")
  net = Net().to(device)
  net.train()

# -----------------------------------------------------------

  # 3. train model
  max_epochs = 1000
  ep_log_interval = 200
  lrn_rate = 0.01

  loss_func = T.nn.NLLLoss()  # assumes log_softmax()
  optimizer = T.optim.SGD(net.parameters(), lr=lrn_rate)

  print("\nbat_size = %3d " % bat_size)
  print("loss = " + str(loss_func))
  print("optimizer = SGD")
  print("max_epochs = %3d " % max_epochs)
  print("lrn_rate = %0.3f " % lrn_rate)

  print("\nStarting training")
  for epoch in range(0, max_epochs):
    # T.manual_seed(epoch+1)  # checkpoint reproducibility
    epoch_loss = 0  # for one full epoch

    for (batch_idx, batch) in enumerate(train_ldr):
      X = batch[0]  # inputs
      Y = batch[1]  # correct class/label/politics

      optimizer.zero_grad()
      oupt = net(X)
      loss_val = loss_func(oupt, Y)  # a tensor
      epoch_loss += loss_val.item()  # accumulate
      loss_val.backward()
      optimizer.step()

    if epoch % ep_log_interval == 0:
      print("epoch = %5d  |  loss = %10.4f" % \
        (epoch, epoch_loss))

  print("Training done ")

# -----------------------------------------------------------

  # 4. evaluate model accuracy
  print("\nComputing model accuracy")
  net.eval()
  acc_train = accuracy(net, train_ds)  # item-by-item
  print("Accuracy on training data = %0.4f" % acc_train)
  acc_test = accuracy(net, test_ds) 
  print("Accuracy on test data = %0.4f" % acc_test)

# -----------------------------------------------------------

  # 5. make a prediction
  print("\nPredicting politics for M  30  oklahoma  $50,000: ")
  X = np.array([[-1, 0.30,  0,0,1,  0.5000]], dtype=np.float32)
  X = T.tensor(X, dtype=T.float32).to(device) 

  with T.no_grad():
    logits = net(X)  # do not sum to 1.0
  probs = T.exp(logits)  # sum to 1.0
  probs = probs.numpy()  # numpy vector prints better
  pred_class = np.argmax(probs)
  np.set_printoptions(precision=4, suppress=True)
  print(probs, end=""); print("  |  " + str(pred_class))

# -----------------------------------------------------------

  # 6. sensitivity analysis
  print("\nExamining effect of age on politics type \n")
  X = np.array([[-1, 0.30,  0,0,1,  0.5000]],
    dtype=np.float32)
  X = T.tensor(X, dtype=T.float32).to(device) 

  age = 0.0
  while age "lt" 0.80:
    X[0][1] = age
    with T.no_grad():
      probs = T.exp(net(X)).numpy()
    pred_class = np.argmax(probs)
    print("%4.2f  |  " % age, end ="")
    print(probs, end ="")
    print("  |  %d " % pred_class)

    age += 0.05
    
  print("\nEnd People sensitivity demo")

if __name__ == "__main__":
  main()
Posted in Machine Learning, PyTorch | 1 Comment

Machine Learning and Economics Alpha Generation Platform

I get cold-call email messages from job recruiters every couple of weeks or so. I usually try to be polite and answer that I’m satisfied with my current position working in the Research department of a very large tech company.

I got a message from a recruiter recently that asked if I was interested in a job related to “hedge fund alpha generation research”. I didn’t know exactly what that meant so I did some googling.

Bottom line: the job position is to find algorithm(s) that make money.

First, I looked up the meaning of “alpha”. According to Wikipedia, “alpha is a measure of the active return on an investment, the performance of that investment compared with a suitable market index.”

OK, but what does that mean?

Eventually I discovered a more concrete definition:

Alpha is a term used in investing to describe an investment strategy’s ability to beat the market, or its “edge.” Alpha is thus also often referred to as “excess return” or “abnormal rate of return,” which refers to the idea that markets are efficient, and so there is no way to systematically earn returns that exceed the broad market as a whole.

With that definition, suppose an investment strategy returns 18% over a year (Ri) and the S and P 500 returns 15% (Rm) for the same period. Then:

alpha = (Ri - Rm) / Rm
      = (18.0 - 15.0) / 15.0
      = 3.0 / 15.0
      = 0.20

which means the strategy is 20% better than the baseline S and P 500. One drawback of this plain alpha measure is that it doesn’t take into account how risky/volatile your investment strategy is.

But there are two different alpha definitions, plain alpha and Jensen’s alpha. Jensen’s alpha is more common, so much so, that “alpha” usually means Jensen’s alpha.

I didn’t fully grasp the idea of Jensen’s alpha until I found an example:

alpha = Ri – [ Rf + b * (Rm – Rf) ]

where Ri is the actual return using some strategy, Rm is the actual return of a market index such as the S and P 500, Rf is the risk-free return rate such as interest from a bank account, and b is the beta value (measure of volatility) of the strategy relative to Rm.

Suppose you could get a 3% return by putting money in a bank account (Rf), and the S and P index returned 12% (Rm), your strategy returned 18% (Ri), and beta for your strategy is 1.5, then

alpha = Ri - [ Rf + b * (Rm - Rf) ]  
      = 18.0 - [ 3.0 + 1.5 * (12.0 - 3.0) ]
      = 18.0 - (3.0 + 13.5)
      = 18.0 - 16.5
      = 1.5

In words, alpha is a measure of how much better (or worse if alpha is negative) a strategy does compared to a risk free investment (like a bank account) and a benchmark (like S and P), weighted by how variable the strategy is (beta). Somewhat unfortunately, there are several ways to compute beta.

So, an alpha generation platform is just a software system that implements an algorithm for financial investments that does better than an index fund or a risk-free investment. Presumably, such a software system uses machine learning.



Posted in Miscellaneous | Leave a comment

NFL 2022 Week 5 Predictions – Zoltar Has Interesting Unhuman Advice

Zoltar is my NFL football prediction computer program. It uses reinforcement learning and a neural network. Here are Zoltar’s predictions for week #5 of the 2022 season. These predictions are fuzzy, in the sense that it usually takes Zoltar a few weeks to hit his stride.

Zoltar:       colts  by    0  dog =     broncos    Vegas:     broncos  by  3.5
Zoltar:     packers  by    7  dog =      giants    Vegas:     packers  by    8
Zoltar:       bills  by    6  dog =    steelers    Vegas:       bills  by   14
Zoltar:      browns  by    2  dog =    chargers    Vegas:    chargers  by    3
Zoltar:     jaguars  by    2  dog =      texans    Vegas:     jaguars  by    7
Zoltar:     vikings  by    6  dog =       bears    Vegas:     vikings  by    7
Zoltar:      saints  by    6  dog =    seahawks    Vegas:      saints  by  6.5
Zoltar:    patriots  by    8  dog =       lions    Vegas:    patriots  by    3
Zoltar:    dolphins  by    0  dog =        jets    Vegas:    dolphins  by  3.5
Zoltar:  buccaneers  by    8  dog =     falcons    Vegas:  buccaneers  by  8.5
Zoltar:      titans  by    0  dog =  commanders    Vegas:      titans  by    3
Zoltar: fortyniners  by    0  dog =    panthers    Vegas: fortyniners  by    7
Zoltar:   cardinals  by    6  dog =      eagles    Vegas:      eagles  by    5
Zoltar:        rams  by    4  dog =     cowboys    Vegas:        rams  by    4
Zoltar:     bengals  by    0  dog =      ravens    Vegas:      ravens  by    3
Zoltar:      chiefs  by    6  dog =     raiders    Vegas:      chiefs  by    7

Zoltar theoretically suggests betting when the Vegas line is “significantly” different from Zoltar’s prediction. In mid-season I use 3.0 points difference but for the first few weeks of the season I am a bit more conservative and use 4.0 points difference as the advice threshold criterion.

At the beginning of the season, because of Zoltar’s initialization (all teams regress to an average power rating) and other algorithms, Zoltar is very strongly biased towards Vegas underdogs. I probably need to fix this. For week #5 Zoltar has some very strange advice:

1. Zoltar likes Vegas underdog Steelers against the Bills.
2. Zoltar likes Vegas underdog Browns against the Chargers.
3. Zoltar likes Vegas underdog Texans against the Jaguars.
4. Zoltar likes Vegas favorite Patriots over the Lions.
5. Zoltar likes Vegas underdog Panthers against the 49ers.
6. Zoltar likes Vegas underdog Cardinals against the Eagles.

For example, a bet on the underdog Browns against the Chargers will pay off if the Browns win by any score, or if the favored Chargers win but by less than 3.0 points (in other words, by 2 points or less). If the favored Chargers win by exactly 3 points, the wager is a push.

As a human, I wouldn’t make most of these wagers. Vegas favors the Bills by an enormous 14.0 points and Zoltar isn’t programmed to deal with that big a spread. The Browns, Texans and Panthers have looked terrible to the human eye so far this season. The Eagles have looked great so far. But the difference between Zoltar and the human eye is what makes this all interesting to me.

Theoretically, if you must bet $110 to win $100 (typical in Vegas) then you’ll make money if you predict at 53% accuracy or better. But realistically, you need to predict at 60% accuracy or better.

In week #4, against the Vegas point spread, Zoltar went 3-2 (using 4.0 points as the advice threshold). Zoltar had six hypothetical pieces of advice but in one game, Vikings vs. Saints, the favored Vikings won by the exact point spread of 3.0 points.

For the season, against the spread, Zoltar is 14-8 (~63% accuracy).

Just for fun, I track how well Zoltar does when just trying to predict just which team will win a game. This isn’t useful except for parlay betting. In week #4, just predicting the winning team, Zoltar went 11-5 which is OK but not great. Vegas was also 11-5 at just predicting the winning team,

Zoltar sometimes predicts a 0-point margin of victory. There are five such games in week #5. In those situations, to pick a winner (only so I can track raw number of correct predictions) in the first few weeks of the season, Zoltar picks the home team to win. After that, Zoltar uses his algorithms to pick a winner.



My system is named after the Zoltar fortune teller machine you can find in arcades. Zoltar uses a crystal ball. There are lots of movies that feature crystal ball fortune tellers. Left: “The Wizard of Oz” (1939). Center: “Labyrinth” (1986). Right: “Harry Potter and the Prisoner of Azkaban” (2004).


Posted in Zoltar | Leave a comment

How to Encode Ordinal Predictor Values for a Neural Network

If you have categorical (also called nominal) predictor data, you can encode it using one-hot encoding. For example, a predictor variable of color with possible values (red, blue, green) can be encoded as red = 1 0 0, blue = 0 1 0, green = 0 0 1.

But what about ordinal data such as a movie rating with possible values (terrible, poor, average, good, excellent)? If you use one-hot encoding, you throw away a lot of information, such as good is closer to excellent than poor is.

When I have an ordinal predictor variable I encode by using values between 0.1 and 1.0. For example, the movie rating values could be terrible = 0.3, poor = 0.4, average = 0.5, good = 0.6, excellent = 0.7. There’s no deep theory here — the idea just seems to make sense. And it has worked well for me in practice.

I ran across an interesting dataset with ordinal predictor variables. The Diamonds dataset has 53,940 data items. Each line of data has ten variables: id, carat, cut, color, clarity, length, width, depth, table-width, price. The cut, color, and clarity variables are ordinal. The data can be found at kaggle.com/datasets/shivam2503/diamonds.

I wrote a helper program to prepare data to predict price from carat, cut, color, and clarity. The raw carat values range between 0.20 to 5.01. I normalized by diving by 10.0.

The raw cut values are (Fair, Good, Very Good, Premium, Ideal). I encoded them as (0.3, 0.4, 0.5, 0.6, 0.7) respectively.

The raw color values (from worst to best) are (J, I, H, G, F, E, D). I encoded them as (0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8).

The raw clarity values (from worst to best) are (I1, SI2, SI1, VS2, VS1, VVS2, VVS1, IF). I encoded them as (0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9).

The raw price values are between $326 and $18,823 so I normalized by dividing by 20,000.

I used all 53,940 data items for training (I didn’t hold out any data as a test dataset).

I implemented a 4-(10-10-10)-1 regression neural network using PyTorch. After training, the model predicted diamond price with 71.23% accuracy, where a correct prediction is one within 15% of the true price.



I’ve never completely understood the whole idea of diamonds for jewelry. To me diamonds are not particularly beautiful or desirable. But I do find great beauty in old electro-mechanical machines. Left: “King of Diamonds” (Gottlieb, 1967) pinball machine. Right: “Diamond Lill” (Gottlieb, 1954) pinball machine.


Demo code. Replace “lt”, “gt”, “lte”, “gte” with Boolean operator symbols.

# diamonds_price.py
# predict diamond price from sex, age, city, job_type
# PyTorch 1.12.1-CPU Anaconda3-2020.02  Python 3.7.6
# Windows 10 

import numpy as np
import torch as T
device = T.device('cpu')  # apply to Tensor or Module

# -----------------------------------------------------------

class DiamondsDataset(T.utils.data.Dataset):
  def __init__(self, src_file):
    # carat, cut, color, clarity, price
    # all numeric in 0.0 to 1.0
    tmp_x = np.loadtxt(src_file, usecols=[0,1,2,3],
      delimiter=",", comments="#", dtype=np.float32)
    tmp_y = np.loadtxt(src_file, usecols=4, delimiter=",",
      comments="#", dtype=np.float32)
    tmp_y = tmp_y.reshape(-1,1)  # 2D required for regression

    self.x_data = T.tensor(tmp_x, dtype=T.float32).to(device)
    self.y_data = T.tensor(tmp_y, dtype=T.float32).to(device)

  def __len__(self):
    return len(self.x_data)

  def __getitem__(self, idx):
    preds = self.x_data[idx]
    price = self.y_data[idx] 
    return (preds, price)  # as a tuple

# -----------------------------------------------------------

class Net(T.nn.Module):
  def __init__(self):
    super(Net, self).__init__()
    self.hid1 = T.nn.Linear(4, 10)  # 4-(10-10-10)-1
    self.hid2 = T.nn.Linear(10, 10)
    self.hid3 = T.nn.Linear(10, 10)
    self.oupt = T.nn.Linear(10, 1)

    T.nn.init.xavier_uniform_(self.hid1.weight)
    T.nn.init.zeros_(self.hid1.bias)
    T.nn.init.xavier_uniform_(self.hid2.weight)
    T.nn.init.zeros_(self.hid2.bias)
    T.nn.init.xavier_uniform_(self.hid3.weight)
    T.nn.init.zeros_(self.hid3.bias)
    T.nn.init.xavier_uniform_(self.oupt.weight)
    T.nn.init.zeros_(self.oupt.bias)

  def forward(self, x):
    z = T.tanh(self.hid1(x))
    z = T.tanh(self.hid2(z))
    z = T.tanh(self.hid3(z))
    z = self.oupt(z)  # regression: no activation
    return z

# -----------------------------------------------------------

def accuracy(model, ds, pct_close):
  # assumes model.eval()
  # correct within pct of true diamond price
  n_correct = 0; n_wrong = 0

  for i in range(len(ds)):
    X = ds[i][0]   # 2-d
    Y = ds[i][1]   # 2-d
    with T.no_grad():
      oupt = model(X)       # computed price

    if T.abs(oupt - Y) "lt" T.abs(pct_close * Y):
      n_correct += 1
    else:
      n_wrong += 1
  acc = (n_correct * 1.0) / (n_correct + n_wrong)
  return acc

# -----------------------------------------------------------

def train(model, ds, bs, lr, me, le):
  # dataset, bat_size, lrn_rate, max_epochs, log interval
  train_ldr = T.utils.data.DataLoader(ds,
    batch_size=bs, shuffle=True)
  loss_func = T.nn.MSELoss()
  optimizer = T.optim.Adam(model.parameters(), lr=lr)
  # optimizer = T.optim.SGD(model.parameters(), lr=lr)

  for epoch in range(0, me):
    epoch_loss = 0  # for one full epoch

    for (b_idx, batch) in enumerate(train_ldr):
      X = batch[0]
      y = batch[1]
      optimizer.zero_grad()
      oupt = model(X)
      loss_val = loss_func(oupt, y)  # a tensor
      epoch_loss += loss_val.item()  # accumulate
      loss_val.backward()  # compute gradients
      optimizer.step()     # update weights

    if epoch % le == 0:
      print("epoch = %4d  |  loss = %0.4f" % \
        (epoch, epoch_loss)) 

# -----------------------------------------------------------

def main():
  # 0. get started
  print("\nBegin diamonds predict price ")
  T.manual_seed(1)
  np.random.seed(1)
  
  # 1. create DataLoader objects
  print("\nCreating diamonds Dataset object ")
  train_file = ".\\Data\\diamonds_all.txt"
  train_ds = DiamondsDataset(train_file)  # 53,940 rows

  bat_size = 100
  train_ldr = T.utils.data.DataLoader(train_ds,
    batch_size=bat_size, shuffle=True)

  # 2. create network
  print("\nCreating 4-(10-10-10)-1 neural network ")
  net = Net().to(device)

# -----------------------------------------------------------

  # 3. train model
  print("\nbat_size = 100 ")
  print("loss = MSELoss() ")
  print("optimizer = Adam ")
  print("lrn_rate = 0.005 ")

  print("\nStarting training")
  net.train()
  train(net, train_ds, bs=100, lr=0.005, me=100, le=10)
  print("Done ")

# -----------------------------------------------------------

  # 4. evaluate model accuracy
  print("\nComputing model accuracy (within 0.15 of true) ")
  net = net.eval()
  acc_train = accuracy(net, train_ds, 0.15)  # item-by-item
  print("Accuracy on train data = %0.4f" % acc_train)

  # acc_test = accuracy_x(net, test_ds, 0.10)  # all-at-once
  # print("Accuracy on test data = %0.4f" % acc_test)

# -----------------------------------------------------------

  # 5. make a prediction
  print("\nPredicting price for carat = 1.5, cut = VG, \
color = G, clarity = VS2: ")
  x = np.array([[0.15, 0.5, 0.5, 0.5]],
    dtype=np.float32)
  x = T.tensor(x, dtype=T.float32).to(device) 

  with T.no_grad():
    pred_price = net(x)
  pred_price = pred_price.item()  # scalar
  print("$%0.2f" % (pred_price * 20_000))  # un-normalized

# -----------------------------------------------------------

  # 6. save model (state_dict approach)
  # print("\nSaving trained model state")
  # fn = ".\\Models\\diamonds_model.pt"
  # T.save(net.state_dict(), fn)

  # saved_model = Net()
  # saved_model.load_state_dict(T.load(fn))
  # use saved_model to make prediction(s)

  print("\nEnd diamonds predict price demo ")

if __name__ == "__main__":
  main()
Posted in Machine Learning, PyTorch | Leave a comment

The Distance Between Two Datasets Using Transformer Encoding

Several months ago I devised an algorithm that computes a value that represents the distance (difference) between two datasets. Computing the distance between two datasets is a remarkably difficult task — I consider it one of the unsolved fundamental problems in computer science.

My dataset distance algorithm for two datasets P and Q works something like this:

create a neural encoder-decoder for P
use encoder to get a frequency distribution for P
use encoder to get a frequency distribution for Q
return an f-divergence between distributions

I used standard neural technology to create the neural encoder-decoder module — basic Linear (PyTorch) / Dense (Keras) layers. See jamesmccaffrey.wordpress.com/2021/09/27/computing-the-distance-between-two-datasets-using-autoencoded-wasserstein-distance/.

But I’ve been working with Transformer Architecture systems and wondered if I could modify my dataset distance algorithm to use a TransformerEncoder instead of standard Linear layers. After a few hours of work, I got such an algorithm working.

For my experiment, I used the UCI Digits dataset. Each data item is an 8 by 8 image of a handwritten digit from ‘0’ to ‘9’. Each of the 64 pixel values is between 0 (white) and 16 (black). I used a 100-item subset of the entire dataset to keep things a bit simpler.

The 100-item dataset was the reference P dataset. I created 10 Q datasets that ranged from 10% randomized to 100% randomized. The idea is that dist(P,Q) should increase as the amount of randomization in Q increases. The graph above shows that’s exactly what happened.

The key network definition is:

class AutoencoderTransformer(T.nn.Module):  # 65-xx-4-32-65
  def __init__(self):
    # 65 numeric inputs: no exact word embedding equivalent
    # pseudo embed_dim = 2
    # seq_len = 65
    super(AutoencoderTransformer, self).__init__()  # classic

    self.fc1 = T.nn.Linear(65, 65*2)  # pseudo-embedding
    self.fc2 = T.nn.Linear(65*2, 4)

    self.pos_enc = \
      PositionalEncoding(2, dropout=0.00)  # positional

    self.enc_layer = T.nn.TransformerEncoderLayer(d_model=2,
      nhead=2, dim_feedforward=100, dropout=0.0,
      batch_first=True)  # d_model divisible by nhead

    self.trans_enc = T.nn.TransformerEncoder(self.enc_layer,
      num_layers=6)

    self.dec1 = T.nn.Linear(4, 32) 
    self.dec2 = T.nn.Linear(32, 65)
     # use default weight initialization
     self.latent_dim = 4

  def encode(self, x):           # x is [bs, 65]
    z = T.tanh(self.fc1(x))      # [bs, 130]
    z = z.reshape(-1, 65, 2)     # [bs, 65, 2]
    z = self.pos_enc(z)          # [bs, 65, 2]
    z = self.trans_enc(z)        # [bs, 65, 2]
    z = z.reshape(-1, 65*2)       # [bs, 130]
    z = T.sigmoid(self.fc2(z))    # [bs, 4]
    return z

  def decode(self, x):
    z = T.tanh(self.dec1(x))     # [bs, 32]
    z = T.sigmoid(self.dec2(z))  # [bs, 65]
    return z    

  def forward(self, x):            # x is [bs,65]
    z = self.encode(x)
    oupt = self.decode(z)
    return oupt

The rest of the code is too long to place in this blog post, and it’s quite tricky. But good fun!



The distance between these parodies and reality is quite small.


Posted in PyTorch, Transformers | Leave a comment

Why I Became Disillusioned With Semi-Supervised Learning

In semi-supervised learning, you have data where only a few items have labels but most data items are not labeled. For example, you might have a data for 1,000 hospital patients (age, sex, blood pressure, etc.) that have been tested for a rare disease, and 100 patients have the disease and 900 do not have the disease. But you also have data for 100,000 patients that haven’t been tested. The goal of semi-supervised learning is to use the 1,000-item disease labeled data to infer labels for the 100,000-item unlabeled data. Then with your new 100,100 labeled items you can presumably create a much better prediction model than you could with just the 1,000 labeled items.

I spent quite a long time doing a deep dive into various algorithms for semi-supervised learning. I was seduced by various highfalutin’ research papers filled with impressive-looking math equations with Greek letters.

But eventually I came to the realization that all the fancy semi-supervised algorithms are essentially based on the same idea — use the small labeled dataset to create a model of some sort and then use that model to infer labels for the large unlabeled dataset. The bottom line is that you can’t magically create information out of thin air.



Part of the Wikipedia entry on semi-supervised learning. It’s easy to be seduced by all the fancy-looking math equations.


I’m certain that a better way to deal with small labeled datasets is to use what’s called data augmentation. The idea is to take a labeled data item and then mutate it in such a way that you don’t change the class label. The classic example is in image recognition. Suppose you are creating a classifier for 10 different animals: aardvark, bear, cat, dog, elephant, frog, gorilla, hyena, iguana, jaguar. And suppose you have 100 labeled images for each animal. You could take each image and rotate it about the vertical axis — a dog facing left and a dog facing right are still clearly dogs.

But you have to be careful, for example, changing colors might be OK (brown dog, black dog) but might not be OK (green lizard is lizard but orange lizard might be a salamander).

I’ve been thinking a bit about data augmentation / mutation for ordinary tabular (not image) data. This is a bit tricky. Suppose you have a dataset of people and are creating a classifier for political leaning (conservative, moderate, liberal) based on sex, age, city, annual income, etc. Mutating a data item income by $1,000 would likely be OK but mutating an age by 5 years could create a bad data item.



In machine learning, augmented datasets are a common idea. In science fiction movies, augmented humans are called cyborgs. Left: In “Cyborg” (1989) scientist Pearl Prophet (actress Deborah Richter) volunteers to become a cyborg to retrieve information to cure a worldwide plague. An OK movie. Center: In “RoboCop” (1987) policeman Alex Murphy (actor Peter Weller) is nearly killed but is revived as a cyborg. A pretty good movie. Right: In “The Machine Girl” (2008) average high school student Ami (actress Minase Yashiro) gets her arm cut off by gangsters but has it replaced by a machine gun. A really weird movie.


Posted in Machine Learning | Leave a comment

Simple Lightweight Interpretability for PyTorch Models

Neural networks are powerful but they’re mysterious with regards to why a particular prediction was made. Here’s an example of how to get interpretability information without using any external libraries or deep theory. In the screenshot below, a PyTorch classifier predicts employee job-type (management, support, technical) from sex, age, city (anaheim, boulder, concord), and income. During training, the program records input gradient information. After training the interpretability information is displayed:

Average gradients for [sex, age, (c1, c2, c3), income]:
raw:        [0.6235 1.7446 0.2060  0.4376 0.2472 1.2193]
normalized: [0.1392 0.3896 0.0460  0.0977 0.0552 0.2723]

The most important input variable for predicting employee job-type is age (0.3896), followed by income (0.2723). The city predictor has relative importance 0.0460 + 0.0977 + 0.0552 = 0.1989.

The idea is to compute gradients for the input variables. Each neural network weight and bias has an associated gradient value. A gradient is just a number like -1.2345 that tells the program how much and in what direction to change its associated weight or bias during training so that the computed outputs are more accurate (less error).

It is possible to compute gradients for input variables too. Each of these gradients is a measure of how much a change in the variable value will change the output values. So a large input gradient value means the associated input variable has large effect on the output.

You could simply fetch the final set of input gradients after training but all these final values will usually be small. So my simple interpretability technique records and saves input gradients during training every few epochs. Then after training the average of the input gradients gives you an interpretability metric.

As is often the case, the devil is in the details. The key code is:

# training
acc_batch_grads = np.zeros((10,6), dtype=np.float32)
. . .
for (batch_idx, batch) in enumerate(train_ldr):
  X = batch[0]  # inputs
  Y = batch[1]  # correct class/label/job-type

  X.requires_grad = True  # enable
 
  optimizer.zero_grad()
  . . .

if epoch % grad_accum_interval == 0:
  curr_batch_grads = X.grad  # [bs, features]  # [bs, 6]
  acc_batch_grads += np.abs(curr_batch_grads.numpy())

# after training
raw_avg_grads = np.mean(acc_batch_grads, axis=0)
norm_avg_grads = raw_avg_grads / np.sum(raw_avg_grads)

The X.requires_grad statement tells PyTorch to compute gradients for the input variables, which isn’t done by default. Because training is done in batches of input items, the gradients will have shape [bat_size, num_features], which is [10, 6] for the demo. Because I’m only interested in the magnitude of the gradients I use the abs() function.

After training, I compute the average for each of the six predictor variables using the mean() function. To make the average gradients for each predictor variable easier to interpret, I normalize them so that they sum to 1 and can be loosely interpreted as percentages.

Several of my research colleagues work on neural network interpretability. There are quite a few theoretically sophisticated techniques for interpretability but I wanted a technique that 1.) is simple and easy to understand, 2.) is flexible enough to apply to any neural system, 3.) doesn’t require any external dependencies.

Most of the Internet search results for PyTorch interpretability point to a library named Captum. I looked at Captum but I feel it’s very much over-engineered and too complex, especially for use in a research or production environment.

The lightweight interpretability approach demonstrated in this blog post is a white box technique because the model source code must be accessed and modified. There are black box interpretability techniques that only need a trained model. The Shapley Value technique is perhaps the most common of these black box techniques.



In machine learning, the devil is often in the details. In humor, sometimes the details are in the devil. Two cartoons by Gary Larson.


Demo code. The data can be found at https://jamesmccaffrey.wordpress.com/2022/04/29/predicting-employee-job-type-using-pytorch-1-10-on-windows-11/.

# employee_job_interpretability.py
# predict job type from sex, age, city, income
# record and track input gradients for interpretability
# PyTorch 1.12.1-CPU Anaconda3-2020.02  Python 3.7.6
# Windows 10/11 

import numpy as np
import time
import torch as T
device = T.device('cpu')  # apply to Tensor or Module

# -----------------------------------------------------------

class EmployeeDataset(T.utils.data.Dataset):
  # sex  age    city      income  job-type
  # -1   0.27   0  1  0   0.7610   2
  # +1   0.19   0  0  1   0.6550   0
  # sex: -1 = male, +1 = female
  # city: anaheim, boulder, concord
  # job type: mgmt, supp, tech

  def __init__(self, src_file, num_rows=None):
    all_xy = np.loadtxt(src_file, max_rows=num_rows,
      usecols=range(0,7), delimiter="\t", comments="#",
      dtype=np.float32)
    tmp_x = all_xy[0:num_rows,0:6]   # cols [0,6) = [0,5]
    tmp_y = all_xy[0:num_rows,6]  # 1-D
    
    self.x_data = T.tensor(tmp_x, 
      dtype=T.float32).to(device)
    self.y_data = T.tensor(tmp_y,
      dtype=T.int64).to(device)  # 1-D

  def __len__(self):
    return len(self.x_data)

  def __getitem__(self, idx):
    preds = self.x_data[idx]
    trgts = self.y_data[idx] 
    return preds, trgts  # as a Tuple

# -----------------------------------------------------------

class Net(T.nn.Module):
  def __init__(self):
    super(Net, self).__init__()
    self.hid1 = T.nn.Linear(6, 10)  # 6-(10-10)-3
    self.hid2 = T.nn.Linear(10, 10)
    self.oupt = T.nn.Linear(10, 3)

    T.nn.init.xavier_uniform_(self.hid1.weight)  # explicit
    T.nn.init.zeros_(self.hid1.bias)
    T.nn.init.xavier_uniform_(self.hid2.weight)
    T.nn.init.zeros_(self.hid2.bias)
    T.nn.init.xavier_uniform_(self.oupt.weight)
    T.nn.init.zeros_(self.oupt.bias)

  def forward(self, x):
    z = T.tanh(self.hid1(x))
    z = T.tanh(self.hid2(z))
    z = T.log_softmax(self.oupt(z), dim=1)  # NLLLoss() 
    return z

# -----------------------------------------------------------

def accuracy(model, ds):
  n_correct = 0; n_wrong = 0
  for i in range(len(ds)):
    X = ds[i][0].reshape(1,-1)  # make it a batch
    Y = ds[i][1].reshape(1)  # 0 1 or 2
    with T.no_grad():
      oupt = model(X)  # logits form

    big_idx = T.argmax(oupt)  # 0 or 1 or 2
    if big_idx == Y:
      n_correct += 1
    else:
      n_wrong += 1

  acc = (n_correct * 1.0) / (n_correct + n_wrong)
  return acc

# -----------------------------------------------------------

def main():
  # 0. get started
  print("\nBegin Employee predict job type with input grads ")
  T.manual_seed(1)
  np.random.seed(1)
  
  # 1. create DataLoader objects
  print("\nCreating Employee Datasets ")

  train_file = ".\\Data\\employee_train.txt"
  train_ds = EmployeeDataset(train_file)  # 200 rows

  test_file = ".\\Data\\employee_test.txt"
  test_ds = EmployeeDataset(test_file)    # 40 rows

  bat_size = 10
  train_ldr = T.utils.data.DataLoader(train_ds,
    batch_size=bat_size, shuffle=True)

# -----------------------------------------------------------

  # 2. create network
  print("\nCreating 6-(10-10)-3 neural network ")
  net = Net().to(device)
  net.train()

# -----------------------------------------------------------

  # 3. train model
  max_epochs = 1000
  ep_log_interval = 100
  lrn_rate = 0.01
  grad_accum_interval = 100  # for interpretability

  loss_func = T.nn.NLLLoss()  # assumes log_softmax()
  optimizer = T.optim.SGD(net.parameters(), lr=lrn_rate)

  print("\nbat_size = %3d " % bat_size)
  print("loss = " + str(loss_func))
  print("optimizer = SGD")
  print("max_epochs = %3d " % max_epochs)
  print("lrn_rate = %0.3f " % lrn_rate)
  print("grad_accum_interval = %d " % grad_accum_interval)

  acc_batch_grads = np.zeros((10,6), dtype=np.float32)
 
  print("\nStarting training")
  for epoch in range(0, max_epochs):
    epoch_loss = 0  # for one full epoch

    for (batch_idx, batch) in enumerate(train_ldr):
      X = batch[0]  # inputs
      Y = batch[1]  # correct class/label/political

      X.requires_grad = True
 
      optimizer.zero_grad()
      oupt = net(X)
      loss_val = loss_func(oupt, Y)  # a tensor
      epoch_loss += loss_val.item()  # accumulate
      loss_val.backward()
      optimizer.step()

    if epoch % ep_log_interval == 0:
      print("epoch = %5d  |  loss = %10.4f" % \
        (epoch, epoch_loss))

    if epoch % grad_accum_interval == 0:
      curr_batch_grads = X.grad  # [bs, features]  # [bs, 6]
      acc_batch_grads += np.abs(curr_batch_grads.numpy())
      
  print("Training done ")

# -----------------------------------------------------------
  
  # 4a. show interpretability info
  print("\nAverage gradients sex, age, (c1, c2, c3), income: ")
  np.set_printoptions(precision=4, suppress=True)
  raw_avg_grads = np.mean(acc_batch_grads, axis=0)
  norm_avg_grads = raw_avg_grads / np.sum(raw_avg_grads)
  
  print("raw:        ", end=""); print(raw_avg_grads)
  print("normalized: ", end=""); print(norm_avg_grads)
  
# -----------------------------------------------------------

  # 4b. evaluate model accuracy
  print("\nComputing model accuracy")
  net.eval()
  acc_train = accuracy(net, train_ds)  # item-by-item
  print("Accuracy on training data = %0.4f" % acc_train)
  acc_test = accuracy(net, test_ds) 
  print("Accuracy on test data = %0.4f" % acc_test)

  print("\nEnd Employee predict job with interpretability")

if __name__ == "__main__":
  main()
Posted in PyTorch | Leave a comment

NFL 2022 Week 4 Predictions – Zoltar Thinks the Raiders Will Cover Over Broncos

Zoltar is my NFL football prediction computer program. It uses reinforcement learning and a neural network. Here are Zoltar’s predictions for week #4 of the 2022 season. These predictions are fuzzy, in the sense that it usually takes Zoltar about four weeks to hit his stride.

Zoltar:     bengals  by    4  dog =    dolphins    Vegas:     bengals  by  3.5
Zoltar:      saints  by    2  dog =     vikings    Vegas:     vikings  by    3
Zoltar:     falcons  by    2  dog =      browns    Vegas:      browns  by  1.5
Zoltar:      titans  by    0  dog =       colts    Vegas:       colts  by  3.5
Zoltar:     cowboys  by    6  dog =  commanders    Vegas:     cowboys  by    3
Zoltar:    seahawks  by    0  dog =       lions    Vegas:       lions  by    6
Zoltar:    chargers  by    0  dog =      texans    Vegas:    chargers  by  5.5
Zoltar:       bears  by    0  dog =      giants    Vegas:      giants  by  3.5
Zoltar:      eagles  by    7  dog =     jaguars    Vegas:      eagles  by  6.5
Zoltar:    steelers  by    6  dog =        jets    Vegas:    steelers  by  3.5
Zoltar:       bills  by    0  dog =      ravens    Vegas:       bills  by  3.5
Zoltar:   cardinals  by    2  dog =    panthers    Vegas:    panthers  by  1.5
Zoltar:     packers  by    6  dog =    patriots    Vegas:     packers  by 10.5
Zoltar:     raiders  by    6  dog =     broncos    Vegas:     raiders  by  1.5
Zoltar:  buccaneers  by    5  dog =      chiefs    Vegas:      chiefs  by  2.5
Zoltar:        rams  by    0  dog = fortyniners    Vegas: fortyniners  by  2.5

Zoltar theoretically suggests betting when the Vegas line is “significantly” different from Zoltar’s prediction. In mid-season I use 3.0 points difference but for the first few weeks of the season I am a bit more conservative and use 4.0 points difference as the advice threshold criterion.

At the beginning of the season, because of Zoltar’s initialization (all teams regress to an average power rating) and other algorithms, Zoltar is very strongly biased towards Vegas underdogs. I probably need to fix this. For week #4 Zoltar likes five Vegas underdogs and one Vegas favorite:

1. Zoltar likes Vegas underdog Saints against the Vikings.
2. Zoltar likes Vegas underdog Seahawks against the Lions.
3. Zoltar likes Vegas underdog Texans against the Chargers.
4. Zoltar likes Vegas underdog Patriots against the Packers.
5. Zoltar likes Vegas favorite Raiders over the Broncos.
6. Zoltar likes Vegas underdog Buccaneers against the Chiefs.

For example, a bet on the underdog Saints against the Vikings will pay off if the Saints win by any score, or if the favored Vikings win but by less than 3.0 points (in other words, by 2 points or less). If the favored Vikings win by exactly 3 points, the wager is a push.

Theoretically, if you must bet $110 to win $100 (typical in Vegas) then you’ll make money if you predict at 53% accuracy or better. But realistically, you need to predict at 60% accuracy or better.

In week #3, against the Vegas point spread, Zoltar went a weak (but somewhat unlucky) 3-3 (using 4.0 points as the advice threshold). Zoltar was on the wrong side of a bad beat in the Steelers vs. Browns game. Zoltar suggested a hypothetical wager on the underdog Steelers where the Browns were -6.0 point spread. On the last play of the game, the Steelers were losing 23-17 and so the hypothetical wager would be a push. But the Steelers tried a crazy play, which completely backfired, and the Browns scored on a fumble recovery to win by 29-17 and so all wagers on the Steelers lost. Arg. Things like this make watching games exciting up until the very end of a game.

For the season, Zoltar is 11-6 (64.7% accuracy).

Just for fun, I track how well Zoltar does when just trying to predict just which team will win a game. This isn’t useful except for parlay betting. In week #3, just predicting the winning team, Zoltar went 10-6 which is OK but not great. Vegas was not so good just predicting winners in week #3, going just 7-9.

Zoltar sometimes predicts a 0-point margin of victory. There are six such games in week #4. In those situations, to pick a winner (only so I can track raw number of correct predictions) in the first few weeks of the season, Zoltar picks the home team to win. After that, Zoltar uses his algorithms to pick a winner.



Left: Electric football was invented in the late 1940s. This is a 1960s era version. The game is still very popular today. Center: Strat-O-Matic football was introduced in 1968 and was intended for teen boys as well as adults. The statistics of the game fascinated me when I was young and probably influenced my love of mathematics and computer science. Right: The 3M Pro Football game was introduced in 1966 but is no longer manufactured. I’ve played it a few times and enjoyed it a lot.


Posted in Zoltar | Leave a comment

“Multi-Class Classification Using New PyTorch Best Practices, Part 2: Training, Accuracy, Predictions” in Visual Studio Magazine

I wrote an article titled “Multi-Class Classification Using New PyTorch Best Practices, Part 2: Training, Accuracy, Predictions” in the September 2022 edition of Microsoft Visual Studio Magazine. See https://visualstudiomagazine.com/articles/2022/09/12/multi-class-pytorch-2.aspx.

The article is the second in a two-part series that explains how to create a PyTorch multi-class classifier system. The article demo program predicts the political leaning (conservative, moderate, liberal) of a person. The first article in the series explained how to prepare the training and test data, and how to define the neural network classifier. The second article explains how to train the network, compute the accuracy of the trained network, use the network to make predictions, and save the network for use by other programs.

The demo begins by loading a 200-item file of training data and a 40-item set of test data. Each tab-delimited line represents a person. The fields are sex, age, state of residence (Michigan, Nebraska or Oklahoma), annual income and politics type (0 = conservative, 1 = moderate, 2 = liberal). The goal is to predict politics type from sex, age, state and income.

After 1,000 training epochs, the demo program computes the accuracy of the trained model on the training data as 81.50 percent (163 out of 200 correct). The model accuracy on the test data is 75.00 percent (30 out of 40 correct).

After evaluating the trained network, the demo predicts the politics type for a person who is male, 30 years old, from Oklahoma and who makes $50,000 annually. The prediction is [0.6905, 0.3049, 0.0047]. These values are pseudo-probabilities. The largest value (0.6905) is at index [0] so the prediction is class 0 = conservative.

The demo concludes by saving the trained model to file so that it can be used without having to retrain the network from scratch. There are two different ways to save a PyTorch model. The demo uses the save-state approach.

The three basic types of PyTorch systems are multi-class classification, binary classification, and regression. Multi-class classification is used when the variable to predict has three or more possible values. When the variable to predict has just two possible values, the problem is called binary classification. Binary classification uses techniques that are different from multi-class classification. Regression problems are ones where the goal is to predict a single numeric value, such as annual income.



Gender and politics.


Posted in PyTorch | Leave a comment