PyTorch Transformer Sequence-to-Sequence: Good Examples are Hard to Find

I’ve been looking at deep neural Transformer Architecture (TA) systems for several months. In terms of conceptual ideas and engineering details, they are probably the most complex software systems I’ve ever worked with.


Update: A few weeks after I wrote this blog post, I created my own example of a sequence-to-sequence problem. See:

jamesmccaffrey.wordpress.com/2022/09/09/simplest-transformer-seq-to-seq-example/

and

jamesmccaffrey.wordpress.com/2022/09/12/using-the-simplest-possible-transformer-sequence-to-sequence-example/


Everyone I know, including me, learns ML in the same way: 1.) find an example program, 2.) get it to run, 3.) add print() statements and make changes to figure out exactly how the example program works, 4.) gradually add new code/ideas.



I found this transformer seq-to-seq example on the Internet. It has several flaws.


So, it all starts with finding a working example program.

As far as I’ve been able to determine, there are no really good example programs that demonstrate PyTorch TA sequence-to-sequence on the Internet. I spent hours dissecting one of the main examples returned by a Google search. It is a blog post written by a student. I found roughly a dozen issues with the example, most minor but some significant.

The data for the example was generated programmatically rather than being some monstrously huge English-to-German NLP data. This is good, but the data made no sense to me. For example, this code statement in the data-generation function:

start = np.random.randint(0, 1)

The randint(a,b) function returns a random integer greater-or-equal-to a and strictly less-than b. So the statement always returns 0.

A few hours into the examination I displayed the values of a predicted output and the expected output just before they were passed to the loss function during training:

print("pred shape: ")
print(pred.shape)
print("y_expected shape: ")
print(y_expected.shape)
input()
loss = loss_fn(pred, y_expected)

The shapes were:

pred shape:
torch.Size([2, 4, 9])
y_expected shape:
torch.Size([2, 9])

Different shapes. This is a bit confusing but possibly correct because the CrossEntrolyLoss function expects a model output in the shape [batch_size, nb_classes, *additional_dims] and a target in the shape [batch_size, *additional_dims] containing the class indices in the range [0, nb_classes-1].

Interestingly, I think I learned more by dissecting the glitches in the example program than I would have if the program was correct.

The point of this blog post is that transformer architecture sequence-to-sequence systems are incredibly complicated. But I’m confident I will figure them out eventually.



The TA seq-to-seq examples I found on the Internet were disappointing. I like these two sci-fi movies but overall they were disappointing to me because they could have been so much better.

Left: “John Carter” (2012) is based on my favorite sci-fi novel of all time, “A Princess of Mars” (1912) by Edgar Rice Burroughs. Two terrible choices for the lead actor and actress. Poor story line and editing.

Right: “Valerian and the City of a Thousand Planets” (2017) was a follow-up in some sense to one of my favorite films of all time, “The Fifth Element” (1997) by director Luc Besson. Two even worse choices for lead actor and actress: a hero who looks like a 15-year old girl, and a heroine who was whining and obnoxious. Ugh. Both movies could have been great instead of merely OK.


Some code I pulled from the example program I was examining. It has many flaws.

# experiment.py
# examine code from a blog post

import numpy as np
import torch as T

device = T.device('cpu') 

# --------------------------------------------------------

def generate_random_data(n):
  SOS_token = np.array([2])  # array with single value 2.0
  EOS_token = np.array([3])
  length = 8

  data = []

  # 1,1,1,1,1,1 -> 1,1,1,1,1  # what?
  for i in range(n // 3):
    X = np.concatenate((SOS_token, np.ones(length),
      EOS_token))
    y = np.concatenate((SOS_token, np.ones(length),
      EOS_token))
    data.append([X, y])

  # 0,0,0,0 -> 0,0,0,0
  for i in range(n // 3):
    X = np.concatenate((SOS_token, np.zeros(length),
      EOS_token))
    y = np.concatenate((SOS_token, np.zeros(length),
      EOS_token))
    data.append([X, y])

  # 1,0,1,0 -> 1,0,1,0,1  # what??
  for i in range(n // 3):
    X = np.zeros(length)
    start = np.random.randint(0, 1)  # WTF? always 0

    X[start::2] = 1

    y = np.zeros(length)
    if X[-1] == 0:
      y[::2] = 1
    else:
      y[1::2] = 1

    X = np.concatenate((SOS_token, X, EOS_token))
    y = np.concatenate((SOS_token, y, EOS_token))
    data.append([X, y])

  np.random.shuffle(data)

  return data  # a list of lists of array!!

# --------------------------------------------------------

def batchify_data(data, batch_size=3, padding=False,
  padding_token=-1):
  batches = []
  for idx in range(0, len(data), batch_size):
    # We make sure we dont get the last bit if its 
    # not batch_size size
    if idx + batch_size  max_batch_length:
            max_batch_length = len(seq)

        # Append X padding tokens until max length
        for seq_idx in range(batch_size):
          remaining_length = max_bath_length - \
            len(data[idx + seq_idx])
          data[idx + seq_idx] += [padding_token] * \
            remaining_length

      batches.append(np.array(data[idx : idx + \
        batch_size]).astype(np.int64))

  print(f"{len(batches)} batches of size {batch_size}")

  return batches

# --------------------------------------------------------

def get_tgt_mask_static(size) -> T.tensor:
  # original version was a model method !?
  # Generates a squeare matrix where the each row
  # allows one word more to be seen
  mask = T.tril(T.ones(size, size) == 1) # Low triangular 
  mask = mask.float()
  mask = mask.masked_fill(mask == 0, 
    float('-inf')) # Convert zeros to -inf
  mask = mask.masked_fill(mask == 1,
    float(0.0)) # Convert ones to 0
        
        # EX for size=5:
        # [[0., -inf, -inf, -inf, -inf],
        #  [0.,   0., -inf, -inf, -inf],
        #  [0.,   0.,   0., -inf, -inf],
        #  [0.,   0.,   0.,   0., -inf],
        #  [0.,   0.,   0.,   0.,   0.]]
        
  return mask

# --------------------------------------------------------

print("\nBegin demo \n")

np.random.seed(1)
T.manual_seed(1)

train_data = generate_random_data(90)
# print(train_data[0])
# input()
# [array([2., 1., 0., 1., 0., 1., 0., 1., 0., 3.]),
#  array([2., 1., 0., 1., 0., 1., 0., 1., 0., 3.])]

# print(train_data[0][0])  # first data item X
# input()
# [2., 1., 0., 1., 0., 1., 0., 1., 0., 3.]

# a list containing two arrays , X, Y

# print(train_data)

train_dataloader = batchify_data(train_data)

# --------------------------------------------------------

for batch in train_dataloader:
  print("----------")
  X, y = batch[:, 0], batch[:, 1]
  X, y = T.tensor(X).to(device), T.tensor(y).to(device)
  print("X: ")
  print(X)  # has SOS and EOS
  input()
  print("y: ")
  print(y)  # identical to X !?
  input()

  y_input = y[:,:-1]  # SOS at front but no EOS at end
  y_expected = y[:,1:]  # no SOS at front, but EOS at end
  print("y_input: ")
  print(y_input)
  input()
  print("y_expected: ")
  print(y_expected)
  input()

  sequence_length = y_input.size(1)
  tgt_mask = get_tgt_mask_static(sequence_length).to(device)

  print("seq len: ")
  print(sequence_length)  # 9 -- inc. SOS
  input()
  print("tgt_mask: ")
  print(tgt_mask)
  input()

  print("----------")

# --------------------------------------------------------

print("\nEnd demo \n")
Posted in PyTorch, Transformers | Leave a comment

ANOVA Using C# in Visual Studio Magazine

I wrote an article titled “ANOVA Using C#” in the August 2022 edition of Microsoft Visual Studio Magazine. See https://visualstudiomagazine.com/articles/2022/08/17/anova-csharp.aspx.

Analysis of variance (ANOVA) is a classical statistics technique that’s used to infer if the means (averages) of three or more groups are all equal, based on samples from the groups. For example, suppose there are three different introductory computer science classes at a university. Each class is taught by the same teacher but uses a different textbook. You want to know if student performance is the same in all three classes or not.

My article walks through an example where:

Group 1: 3, 4, 6, 5
Group 2: 8, 12, 9, 11, 10, 8
Group 3: 13, 9, 11, 8, 12
Mean 1: (3 + 4 + 6 + 5) / 4 = 18 / 4 = 4.50
Mean 2: (8 + 12 + . . + 8) / 6 = 58 / 6 = 9.67
Mean 3: (13 + 9 + . . + 12) / 5 = 53 / 5 = 10.60
Overall: (3 + 4 + . . + 12) / 15 = 129 / 15 = 8.60

Group means and an overall mean are computed. The means are used to compute SSb and SSw values. The SSb and SSw values are used to compute MSb and MSw values. The MSb and MSw values are used to compute an F-statistic. The F-statistice is used to compute a calculated p-value = 0.0004.

Loosely speaking, the p-value is the likelihood that all three means are the same. Because the p-value is so small, the conclusion is that the means are not all the same. Looking at the data, it appears that the mean of Group 1 is smaller than the means of Group 2 and Group 3.

The results of an ANOVA analysis are probabilistic and should be interpreted conservatively. For real-world data, the computed p-value is only an indication of the likelihood that the source population means are all the same. For small p-values (where “small” depends on your particular problem scenario), an appropriate conclusion is something like, “the sample data suggest that it’s unlikely that the population means are all the same.” For large p-values, an appropriate conclusion is, “the sample data suggest that all k source populations likely have the same mean.”

One significant weakness of ANOVA is that it’s often impossible to test the assumptions that the data sources are Gaussian distributed and have equal variances.



Analysis of Variance is based on the variability of sample data. One of my favorite book series is the Mars series by Edgar Rice Burroughs. Here are three covers of the third book in the series, “The Warlord of Mars” (1914), that have high visual variability. Left: By artist Robert Abbett. Center: By artist Michael Whelan. Right: By artist Gino D’Achille.


Posted in Machine Learning | Leave a comment

A Custom Embedding Layer for Numeric Input for PyTorch

Transformer architecture (TA) neural networks were designed for natural language processing (NLP). I’ve been exploring the idea of applying TA to tabular data. The problem is that in NLP all inputs are integers that represent words/tokens. For example, an input of “I think therefore I am” is mapped to integer tokens something like [19, 47, 132, 19, 27]. Then the integer tokens are converted to an embedding vector. For example 19 = [0.1234, -1.0987, 0.3579, 1.1333] where the number of values (4 here) is a hyperparameter called the embedding dim. The embedding values are determined during training.



Demo of a custom embedding layer for numeric input data


Now suppose that instead of dealing with NLP input, you are dealing with numeric input such as a person’s normalized age of 0.31 and normalized annual income of 0.7850. Because the inputs are not integers, you can’t use the PyTorch built-in torch.nn.Embedding layer to create embedding vectors. I wondered if it would be possible to create a custom embedding layer that converts numeric input into embedding vectors.

After some experimentation I managed to create an example of a custom PyTorch embedding layer for numeric input data.



When I design complex neural architectures I often use pen and paper. Here’s the paper I used while designing the code presented in this blog post. The paw in the lower right is a canine visitor named “Llama” who was helping me.


I used the Iris dataset. It has four numeric input values: sepal length, sepal width, petal length, petal width. The goal is to classify an iris flower as one of three species: setosa (0), versicolor (1), or virginica (2). Each input is converted to an embedding vector with 2 values.

Note: Conceptually, a word embedding creates vectors where similar words (“boy” and “man”) are mathematically close together. For numeric input, an embedding doesn’t do that. The idea is to create a layer that isn’t fully connected and therefore the input values don’t have a direct relationship with each other. The ideas are pretty deep.

My experiment was hard-coded specifically for the Iris dataset and is just a proof of concept. The idea is to create a separate weight matrix for each of the four input values. Each of the four inputs generates a temp result matrix, and then the four temp matrices are combined into the final result.

The key network definition code looks like

class Net(T.nn.Module):
  def __init__(self):
    super(Net, self).__init__()  # Python 3.2 and earlier
    self.embed = NumericEmbedLayer(4, 2)  # 4-8
    self.hid1 = T.nn.Linear(8, 10)        # 8-10
    self.oupt = T.nn.Linear(10, 3)        # 10-3
    
  def forward(self, x):       # x is [bs, 4]
    z = self.embed(x)         # z is [bs, 8]
    z = T.tanh(self.hid1(z))  # z is [bs, 10]
    z = T.log_softmax(self.oupt(z), dim=1)  # NLLLoss() 
    return z                  # z is [bs, 3]

The 4 inputs are fed to the custom NumericEmbedLayer which produces 8 values. Those 8 values go to a hidden layer which outputs 10 values. The final output layer maps the 10 values to 3 values.

The experiement was a lot more difficult than I thought it’d be. Creating a general purpose embedding layer for arbitrary numeric input would require a significant effort. Maybe I’ll get around to it some day.



Three nice images from a search for “embedded portrait”. Left: By artist Hans Jochem Bakker. Center: By artist Christopher Kennedy. Right: By artist Daniel Arrhakis.


Demo code. Replace “lt”, “gt”, “lte”, “gte” with Boolean operator symbols.

# iris_embedding.py
# PyTorch 1.10.0-CPU Anaconda3-2020.02  Python 3.7.6
# Windows 10/11 

# experiment with embedding for numeric data

import numpy as np
import torch as T
device = T.device('cpu')  # apply to Tensor or Module

# -----------------------------------------------------------

class NumericEmbedLayer(T.nn.Module):
  def __init__(self, n_in, embed_dim):  # n_in = 4 for Iris
    super().__init__()  # shortcut syntax

    # hard-coded for Iris dataset - not a general soln

    # one weight matrix per feature
    self.weights_0 = T.nn.Parameter(T.zeros((embed_dim, 1),
      dtype=T.float32))
    self.weights_1 = T.nn.Parameter(T.zeros((embed_dim, 1),
      dtype=T.float32))
    self.weights_2 = T.nn.Parameter(T.zeros((embed_dim, 1),
      dtype=T.float32))
    self.weights_3 = T.nn.Parameter(T.zeros((embed_dim, 1),
      dtype=T.float32))
    # no biases

    T.nn.init.uniform_(self.weights_0, -0.10, 0.10)
    T.nn.init.uniform_(self.weights_1, -0.10, 0.10)
    T.nn.init.uniform_(self.weights_2, -0.10, 0.10)
    T.nn.init.uniform_(self.weights_3, -0.10, 0.10)

  def forward(self, x):
    col_0 = x[:,0:1]  # fetch each input column
    col_1 = x[:,1:2]
    col_2 = x[:,2:3]
    col_3 = x[:,3:4]

    # create the embeddings
    tmp_0 = T.mm(col_0, self.weights_0.t())  # [bs, 2]
    tmp_1 = T.mm(col_1, self.weights_1.t())
    tmp_2 = T.mm(col_2, self.weights_2.t())
    tmp_3 = T.mm(col_3, self.weights_3.t())

    # combine
    res = T.hstack((tmp_0, tmp_1, tmp_2, tmp_3))  # [bs, 8]
    return res

# -----------------------------------------------------------

class IrisDataset(T.utils.data.Dataset):
  def __init__(self, src_file, num_rows=None):
    # 5.0, 3.5, 1.3, 0.3, 0
    tmp_x = np.loadtxt(src_file, max_rows=num_rows,
      usecols=range(0,4), delimiter=",", comments="#",
      dtype=np.float32)
    tmp_y = np.loadtxt(src_file, max_rows=num_rows,
      usecols=4, delimiter=",", comments="#",
      dtype=np.int64)

    self.x_data = T.tensor(tmp_x, dtype=T.float32).to(device)
    self.y_data = T.tensor(tmp_y, dtype=T.int64).to(device)

  def __len__(self):
    return len(self.x_data)

  def __getitem__(self, idx):
    preds = self.x_data[idx]
    spcs = self.y_data[idx] 
    sample = { 'predictors' : preds, 'species' : spcs }
    return sample  # as Dictionary

# -----------------------------------------------------------

class Net(T.nn.Module):
  def __init__(self):
    super(Net, self).__init__()  # Python 3.2 and earlier
    # super().__init__()  # shortcut syntax 3.3 and later
    self.embed = NumericEmbedLayer(4, 2)  # 4-8
    self.hid1 = T.nn.Linear(8, 10)        # 8-10
    self.oupt = T.nn.Linear(10, 3)        # 10-3
    
    # override default initialization
    lo = -0.10; hi = +0.10
    T.nn.init.uniform_(self.hid1.weight, lo, hi)
    T.nn.init.zeros_(self.hid1.bias)
    T.nn.init.uniform_(self.oupt.weight, lo, hi)
    T.nn.init.zeros_(self.oupt.bias)
    
  def forward(self, x):       # x is [bs, 4]
    z = self.embed(x)         # z is [bs, 8]
    z = T.tanh(self.hid1(z))  # z is [bs, 10]
    z = T.log_softmax(self.oupt(z), dim=1)  # NLLLoss() 
    return z                  # z is [bs, 3]

# -----------------------------------------------------------

def accuracy(model, dataset):
  # assumes model.eval()
  dataldr = T.utils.data.DataLoader(dataset, batch_size=1,
    shuffle=False)
  n_correct = 0; n_wrong = 0
  for (_, batch) in enumerate(dataldr):
    X = batch['predictors'] 
    Y = batch['species']  # already 1D shaped by Dataset
    with T.no_grad():
      oupt = model(X)  # logits form

    big_idx = T.argmax(oupt)
    # if big_idx.item() == Y.item():
    if big_idx == Y:
      n_correct += 1
    else:
      n_wrong += 1

  acc = (n_correct * 1.0) / (n_correct + n_wrong)
  return acc

# -----------------------------------------------------------

def main():
  # 0. get started
  print("\nBegin Iris numeric embedding experiment ")
  T.manual_seed(1)
  np.random.seed(1)
  
  # 1. create DataLoader objects
  print("\nCreating Iris train and test Datasets ")

  train_file = ".\\Data\\iris_train.txt"  
  test_file = ".\\Data\\iris_test.txt"  

  train_ds = IrisDataset(train_file)  # 120 items
  test_ds = IrisDataset(test_file)    # 30 

  bat_size = 6
  train_ldr = T.utils.data.DataLoader(train_ds,
    batch_size=bat_size, shuffle=True)
  
# -----------------------------------------------------------

  # 2. create network
  print("\nCreating 4-(8)-10-3 neural network ")
  net = Net().to(device)

  # 3. train model
  max_epochs = 500
  ep_log_interval = 50
  lrn_rate = 0.01

  loss_func = T.nn.NLLLoss()  # assumes log_softmax()
  optimizer = T.optim.SGD(net.parameters(), lr=lrn_rate)

  print("\nbat_size = %3d " % bat_size)
  print("loss = " + str(loss_func))
  print("optimizer = SGD")
  print("max_epochs = %3d " % max_epochs)
  print("lrn_rate = %0.3f " % lrn_rate)

  print("\nStarting training")
  net.train()
  for epoch in range(0, max_epochs):
    epoch_loss = 0  # for one full epoch
    for (batch_idx, batch) in enumerate(train_ldr):
      X = batch['predictors']  # [10,4]
      Y = batch['species']  # OK; alreay flattened
      optimizer.zero_grad()
      oupt = net(X)
      loss_val = loss_func(oupt, Y)  # a tensor
      epoch_loss += loss_val.item()  # accumulate
      loss_val.backward()  # compute gradients
      optimizer.step()     # update weights and biases

    if epoch % ep_log_interval == 0:
      print("epoch = %4d  |  loss = %8.4f  | " % \
        (epoch, epoch_loss), end="")
      net.eval()
      train_acc = accuracy(net, train_ds)
      print(" acc = %8.4f " % train_acc)
      net.train()
  print("Done ")

# -----------------------------------------------------------

  # 4. evaluate model accuracy
  print("\nComputing model accuracy")
  net.eval()
  acc = accuracy(net, test_ds)  # item-by-item
  print("Accuracy on test data = %0.4f" % acc)
  
  # 5. make a prediction
  print("\nPredicting species for [6.1, 3.1, 5.1, 1.1]: ")
  x = np.array([[6.1, 3.1, 5.1, 1.1]], dtype=np.float32)
  x = T.tensor(x, dtype=T.float32).to(device) 

  with T.no_grad():
    logits = net(x)      # as log_softmax
  probs = T.exp(logits)    # pseudo-probs
  T.set_printoptions(precision=4)
  print(probs)

# -----------------------------------------------------------

  # 6. save model (state_dict approach)
  print("\nSaving trained model state")
  fn = ".\\Models\\iris_model.pt"
  T.save(net.state_dict(), fn)

  # saved_model = Net()
  # saved_model.load_state_dict(T.load(fn))
  # use saved_model to make prediction(s)

  print("\nEnd numeric embedding experiment ")

if __name__ == "__main__":
  main()
Posted in PyTorch | Leave a comment

Converting Fashion-MNIST Binary Files to Text Files

The MNIST (Modified National institute of Standards and Technology) dataset contains images of handwritten digits from ‘0’ to ‘9’. Each image is 28 by 28 pixels and each pixel is a grayscale value between 0 and 255.

The MNIST data is stored in a custom binary format that’s not directly usable. Furthermore, the pixel values and class label values are stored in two different files. There are 60,000 MNIST training images and 10,000 test images.

I always convert MNIST from their raw binary format into train and test text files. I usually put each image on a line where the first 784 values are the pixels and the last value is the digit/label. See jamesmccaffrey.wordpress.com/2022/01/21/working-with-mnist-data/.

The main problem with the MNIST dataset is that it’s too easy to create a good classifier. You can easily get a model with well over 99% accuracy.

The Fashion-MNIST was designed to be a drop-in replacement for MNIST. Fashion-MNIST is identical to MNIST except that each image is one of ten pieces of clothing:

0   T-shirt/top
1   Trouser
2   Pullover
3   Dress
4   Coat
5   Sandal
6   Shirt
7   Sneaker
8   Bag
9   Ankle boot

Fashion-MNIST is at github.com/zalandoresearch/fashion-mnist. I decided to see if my utility program that converts MNIST binary to text could be adapted to do the same for Fashion-MNIST. Bottom line: yes, creating Fashion-MNIST text files is almost exactly the same as creating MNIST text files.

Briefly:

1. manually download four gzipped-binary files from
   github.com/zalandoresearch/fashion-mnist/tree
   /master/data/fashion 
2. use 7-Zip to unzip files, add ".bin" extension
3. determine format you want and modify script
4. run the script

The script works like this:

open pixels binary file for reading
open labels binary file for reading
open destination file for writing
read and discard file header info

set number images wanted
loop number images times
  get label value from label file
  convert label from binary to text
  loop 784 times
    get pixel value from pixels file
    convert from binary to text
    write pixel value to file
  end-loop
  write label value
  write newline
end-loop

I intend to use the Fashion-MNIST dataset to do some experiments with warm-start training: train an MNIST classifier model from scratch, put the MNIST model weights in an empty Fashion-MNIST classifier, train the Fashion-MNIST model (warm-start), and see if the resulting classifier is belter than if the classifier had been trained from scratch.



There is a long and fascinating history of computer science related to chess. There is not a long history of fashion related to computer science. Here are three examples of fashion inspired by chess.


Code. Replace “lt”, “gt”, “lte”, “gte” with Boolean operator symbols.

# converter_f-mnist.py
# Anaconda3-2020.02 - Python 3.7.6

import numpy as np
import matplotlib.pyplot as plt

# convert Fashion MNIST binary to text file; 
# combine pixels and labels
# target format:
# pixel_1 (tab) pixel_2 (tab) . . pixel_784 (tab) digit

# 0 = T-shirt/top, 1 = Trouser, 2 = Pullover
# 3 = Dress, 4 = Coat, 5 = Sandal, 6 = Shirt
# 7 = Sneaker, 8 = Bag, 9 = Ankle boot.

# 1. manually download four gzipped-binary files from
# github.com/zalandoresearch/fashion-mnist/tree
#   /master/data/fashion 
# 2. use 7-Zip to unzip files, add ".bin" extension
# 3. determine format you want and modify script

def convert(img_file, label_file, txt_file, n_images):
  print("\nOpening binary pixels and labels files ")
  lbl_f = open(label_file, "rb")   # F-MNIST has labels
  img_f = open(img_file, "rb")     # and pixel vals separate
  print("Opening destination text file ")
  txt_f = open(txt_file, "w")      # output file to write to

  print("Discarding binary pixel and label files headers ")
  img_f.read(16)   # discard header info
  lbl_f.read(8)    # discard header info

  print("\nReading binary files, writing to text file ")
  print("Format: 784 pixel vals then label val, tab delimited ")
  for i in range(n_images):   # number images requested 
    lbl = ord(lbl_f.read(1))  # get label (unicode, one byte) 
    for j in range(784):  # get 784 vals from the image file
      val = ord(img_f.read(1))
      txt_f.write(str(val) + "\t") 
    txt_f.write(str(lbl) + "\n")
  img_f.close(); txt_f.close(); lbl_f.close()
  print("\nDone ")

def display_from_file(txt_file, idx):
  all_data = np.loadtxt(txt_file, delimiter="\t",
    usecols=range(0,785), dtype=np.int64)

  x_data = all_data[:,0:784]  # all rows, 784 cols
  y_data = all_data[:,784]    # all rows, last col

  label = y_data[idx]
  print("label = ", str(label), "\n")

  pixels = x_data[idx]
  pixels = pixels.reshape((28,28))
  for i in range(28):
    for j in range(28):
      # print("%.2X" % pixels[i,j], end="")
      print("%3d" % pixels[i,j], end="")
      print(" ", end="")
    print("")

  plt.tight_layout()
  plt.imshow(pixels, cmap=plt.get_cmap('gray_r'))
  plt.show()  

# -----------------------------------------------------------

# -----------------------------------------------------------

def main():
  n_images = 1000
  print("\nCreating %d F-MNIST train images from binary files "\
   % n_images)
  convert(".\\UnzippedBinary\\train-images-idx3-ubyte.bin",
          ".\\UnzippedBinary\\train-labels-idx1-ubyte.bin",
          "f-mnist_train_1000.txt", 1000)

  n_images = 100
  print("\nCreating %d F-MNIST test images from binary files " %\
    n_images)
  convert(".\\UnzippedBinary\\t10k-images-idx3-ubyte.bin",
          ".\\UnzippedBinary\\t10k-labels-idx1-ubyte.bin",
          "f-mnist_test_100.txt", 100)

  print("\nShowing train image [0]: ")
  img_file = ".\\f-mnist_train_1000.txt"
  display_from_file(img_file, idx=0)  # first image
  
if __name__ == "__main__":
  main()
Posted in PyTorch | Leave a comment

Nob’s Number Puzzle

I ran across a nice mathematical puzzle recently. It’s called “Nob’s Number Puzzle”. Here it is:

The answer is given at the bottom of this post. The puzzle is clever but not some sort of crazy thing: you should be able to find the missing number with a bit of thought.

“Nob” is the nickname for Nobuyuki Yoshigahara (1936–2004) a well-known puzzle creator.

I remember taking tests when I was young that contained questions where the goal is to find the next number in a sequence. For example, 1, 1, 2, 3, 5, 8, 13, 21, ? is the Fibonacci sequence where each number is the sum of the previous two numbers, so the missing number is 13 + 21 = 34.

One of my favorite missing number puzzles is: 1, 4, 9, 7, 7, 9, 4, 1, ?

The sequence is 1^2 = 1, 2^2 = 4, 3^2 = 9, 4^2 = 16 = 7, 5^2 = 25 = 7, 6^2 = 36 = 9, 7^2 = 49 = 13 = 4, and so on. You square each integer in order and if the result is more than one digit, you add the digits. So the missing number is 9^2 = 81 = 9.

Knowing this squared-number puzzle was the reason I was able to solve Nob’s number puzzle quickly. That is a hint for Nob’s puzzle.

Years ago, puzzle questions like these were very common in job interviews at places like Microsoft and Google. I never liked puzzle questions for job interviews and I never used them. My go-to question in a job interview is “Tell me about a project or hobby you’ve done in your spare time that you’re proud of.”

For example, if I were asked this question in an interview, I’d tell the interviewer about my Zoltar program that predicts the results of NFL professional football games.

When I ask a job candidate this question in a job interview, one of two things usually happens.

First, a candidate can come up with an answer quickly and get very excited to tell me about their project in great detail. In general, I am impressed with a guy like this. The exact project doesn’t matter as much as the passion I see in him.

Second, a candidate doesn’t come up with a hobby-project or tells me about a school assignment. This isn’t a deal-killer but I’m not impressed and continue the interview with additional questions to see if the candidate has a true love of programming and software development, or if he is just looking for a job.


Answer to Nob’s number puzzle: the missing number is 12. On the top row, 7 + 2 + 9 + 9 = 27. Second row, 2 + 7 + 4 + 5 = 18. And so on. Simple and elegant!

Notice that if the numbers were written as 7 2, 9 9 instead of 72, 99, the puzzle would be much easier to solve.




Three puzzle boxes. The goal is to open the box. Puzzle boxes range from simple and easy to sophisticated and almost impossible to solve.


Posted in Miscellaneous | Leave a comment

Transformer Based Anomaly Detection for Tabular Data

I recently explored an anomaly detection system based on a Transformer encoder and reconstruction error. The inputs for my example were the UCI Digits dataset items. Each data item is 64 pixel values between 0 and 16. I used UCI Digits items because they easily mapped to a Transformer encoder.

The idea is that a Transformer model was originally designed for input items that are words, such as “I think therefore I am”. Each word/token is mapped to an integer ID such as [17, 283, 167, 17, 35]. Then each word/token ID is mapped to a word embedding such as 17 = [0.1234, 0.9876, 0.2468, 0.1357]. For UCI Digits data items, each pixel value corresponds to a word/token.

But what about tabular data? For example, suppose you have a dataset of employee information like so:

 1   0.24   1   0   0   0.2950   0   0   1
-1   0.39   0   0   1   0.5120   0   1   0
 1   0.63   0   1   0   0.7580   1   0   0
-1   0.36   1   0   0   0.4450   0   1   0
. . .

The columns represent sex (male = -1, female = +1), age (divided by 100), city (one-hot encoded), income (divided by 100,000) and job type (one-hot encoded).

I coded up a demo. Because the employee data aren’t integers, I skipped the standard embedding layer and replaced it with a Linear layer. I was mildly surprised when the demo system seemed to work quite well.

Very interesting.



In real life, man-eating plants are anomalous. But in old comic book covers, man-eating plants are common.


Demo code. Replace “lt”, “gt”, “lte”, “gte” with Boolean operator symbols. The Employee data can be found at:

jamesmccaffrey.wordpress.com/2022/05/17/autoencoder-anomaly-detection-using-pytorch-1-10-on-windows-11/

# employee_trans_anomaly.py
# Transformer based reconstruction error anomaly detection
# PyTorch 1.10.0-CPU Anaconda3-2020.02  Python 3.7.6
# Windows 10/11

import numpy as np
import torch as T

device = T.device('cpu') 
T.set_num_threads(1)

# -----------------------------------------------------------

class EmployeeDataset(T.utils.data.Dataset):
  # sex  age   city     income  job
  # -1   0.27  0  1  0  0.7610  0  0  1
  # +1   0.19  0  0  1  0.6550  0  1  0
  # sex: -1 = male, +1 = female
  # city: anaheim, boulder, concord
  # job: mgmt, supp, tech

  def __init__(self, src_file):
    tmp_x = np.loadtxt(src_file, usecols=range(0,9),
      delimiter="\t", comments="#", dtype=np.float32)
    self.x_data = T.tensor(tmp_x, dtype=T.float32).to(device)

  def __len__(self):
    return len(self.x_data)

  def __getitem__(self, idx):
    preds = self.x_data[idx, :]  # row idx, all cols
    sample = { 'predictors' : preds }  # as Dictionary
    return sample  

# -----------------------------------------------------------

class PositionalEncoding(T.nn.Module):  # documentation code
  def __init__(self, d_model: int, dropout: float=0.1,
   max_len: int=5000):
    super(PositionalEncoding, self).__init__()  # old syntax
    self.dropout = T.nn.Dropout(p=dropout)
    pe = T.zeros(max_len, d_model)  # like 10x4
    position = \
      T.arange(0, max_len, dtype=T.float).unsqueeze(1)
    div_term = T.exp(T.arange(0, d_model, 2).float() * \
      (-np.log(10_000.0) / d_model))
    pe[:, 0::2] = T.sin(position * div_term)
    pe[:, 1::2] = T.cos(position * div_term)
    pe = pe.unsqueeze(0).transpose(0, 1)
    self.register_buffer('pe', pe)  # allows state-save

  def forward(self, x):
    x = x + self.pe[:x.size(0), :]
    return self.dropout(x)

# -----------------------------------------------------------

class Transformer_Net(T.nn.Module):
  def __init__(self):
    # 9 numeric inputs: no exact word embedding equivalent
    # pseudo embed_dim = 4
    # seq_len = 9
    super(Transformer_Net, self).__init__()

    self.fc1 = T.nn.Linear(9, 9*4)  # pseudo-embedding

    self.pos_enc = \
      PositionalEncoding(4, dropout=0.00)  # positional

    self.enc_layer = T.nn.TransformerEncoderLayer(d_model=4,
      nhead=2, dim_feedforward=100, 
      batch_first=True)  # d_model divisible by nhead

    self.trans_enc = T.nn.TransformerEncoder(self.enc_layer,
      num_layers=6)

    self.dec1 = T.nn.Linear(36, 18)
    self.dec2 = T.nn.Linear(18, 9)

    # use default weight initialization

  def forward(self, x):
    # x is Size([bs, 9])
    z = T.tanh(self.fc1(x))   # [bs, 36]
    z = z.reshape(-1, 9, 4)   # [bs, 9, 4] 
    z = self.pos_enc(z)       # [bs, 9, 4]
    z = self.trans_enc(z)     # [bs, 9, 4]

    z = z.reshape(-1, 36)              # [bs, 36]
    z = T.tanh(self.dec1(z))           # [bs, 18]
    z = self.dec2(z)  # no activation  # [bs, 9]
  
    return z

# -----------------------------------------------------------

def analyze_error(model, ds):
  largest_err = 0.0
  worst_x = None
  worst_y = None
  n_features = len(ds[0]['predictors'])

  for i in range(len(ds)):
    X = ds[i]['predictors']
    with T.no_grad():
      Y = model(X)  # should be same as X
    err = T.sum((X-Y)*(X-Y)).item()  # SSE all features
    err = err / n_features           # sort of norm'ed SSE 

    if err "gt" largest_err:  # replace here
      largest_err = err
      worst_x = X
      worst_y = Y

  np.set_printoptions(formatter={'float': '{: 0.4f}'.format})
  print("Largest reconstruction error: %0.4f" % largest_err)
  print("Worst data item    = ")
  print(worst_x.numpy())
  print("Its reconstruction = " )
  print(worst_y.numpy())

# -----------------------------------------------------------

def main():
  # 0. get started
  print("\nBegin Employee transformer based anomaly detect ")
  T.manual_seed(0)
  np.random.seed(0)
  
  # 1. create DataLoader objects
  print("\nCreating Employee Dataset ")

  data_file = ".\\Data\\employee_all.txt"
  data_ds = EmployeeDataset(data_file)  # 240 rows

  bat_size = 10
  data_ldr = T.utils.data.DataLoader(data_ds,
    batch_size=bat_size, shuffle=True)

  # 2. create network
  print("\nCreating Transformer encoder-decoder network ")
  net = Transformer_Net().to(device)

# -----------------------------------------------------------

  # 3. train autoencoder model
  max_epochs = 100
  ep_log_interval = 10
  lrn_rate = 0.005

  loss_func = T.nn.MSELoss()
  optimizer = T.optim.Adam(net.parameters(), lr=lrn_rate)

  print("\nbat_size = %3d " % bat_size)
  print("loss = " + str(loss_func))
  print("optimizer = Adam")
  print("lrn_rate = %0.3f " % lrn_rate)
  print("max_epochs = %3d " % max_epochs)
  
  print("\nStarting training")
  net.train()
  for epoch in range(0, max_epochs):
    epoch_loss = 0  # for one full epoch

    for (batch_idx, batch) in enumerate(data_ldr):
      X = batch['predictors'] 
      Y = batch['predictors'] 

      optimizer.zero_grad()
      oupt = net(X)
      loss_val = loss_func(oupt, Y)  # a tensor
      epoch_loss += loss_val.item()  # accumulate
      loss_val.backward()
      optimizer.step()

    if epoch % ep_log_interval == 0:
      print("epoch = %4d  |  loss = %0.4f" % \
       (epoch, epoch_loss))
  print("Done ")

# -----------------------------------------------------------

  # 4. find item with largest reconstruction error
  print("\nAnalyzing data for largest reconstruction error \n")
  net.eval()
  analyze_error(net, data_ds)

  print("\nEnd transformer autoencoder anomaly demo ")

if __name__ == "__main__":
  main()
Posted in Transformers | Leave a comment

Researchers Demonstrate Transformer Architecture-Based Anomaly Detection for Cybersecurity on the Pure AI Web Site

I contributed to an article titled “Researchers Demonstrate Transformer Architecture-Based Anomaly Detection for Cybersecurity” on the Pure AI web site. See https://pureai.com/articles/2022/08/02/ta-anomaly-detection.aspx.

Researchers at Microsoft have demonstrated a new technique for anomaly detection. The technique is based on deep neural transformer architecture (TA). TA is an architecture that was originally intended for natural language processing. However, over the past two years, TA based systems have been successfully adapted for other problem scenarios.

The screenshot below illustrates how the TA anomaly detection technique works. The system shown successfully identifies one anomalous item that had been placed into a dataset of 100 normal items. The data items are a subset of the UCI Digits dataset. Each item is a crude handwritten digit from “0” to “9.”

The experiment created an anomalous item using a technique called the fast gradient sign method (FGSM) attack.



TA was originally developed in 2017 and was designed to handle long sequences of words, such as a paragraph of text. TA systems proved to be very successful and quickly replaced earlier systems based on LSTM architecture (“long, short-term memory”).

The anomaly detection system scans each item in the source dataset and uses a Transformer component to generate a condensed latent representation of the item. Then, a standard deep neural network decodes the latent representations and expands each item back to a format that is the same as the source data. The detection system compares each item’s original value with its reconstructed value. Data items with large reconstruction error don’t fit the TA model and must be anomalous in some way.

I am quoted in the article: “We were somewhat surprised at how effective the transformer architecture anomaly detection system was on small dummy datasets. The TA system worked much better than other anomaly detection systems at detecting FGSM attack items.”



During World War II (1940-1945) almost all fighter aircraft had propellors in front. Experimental planes with a pusher configuration were an anomalous design. Here are three U.S. experiments, all introduced in 1943. Although promising, by 1943 it was clear that jet-powered aircraft were the future, so none of these designs were pursued. Left: The Vultee XP-54 “Swoose Goose”. Center: The Curtiss-Wright XP-55 “Ascender” (jokingly called the “ass ender”). Right: The Northrop XP-56 “Black Bullet”.


Posted in Machine Learning | Leave a comment

PyTorch Word Embedding Layer from Scratch

The PyTorch neural library has a torch.nn.Embedding() layer that converts a word integer token to a vector. For example, “the” = 5 might be converted to a vector like [0.1234, -1.1044, 0.9876, 1.0234], assuming the embed_dim = 4. The values of the embedding vector are learned during training.

I tried to look up the source code for the Embedding class but quickly discovered a complex nightmare of dozens of C++ and Python files. So, to test my knowledge, I implemented a PyTorch embedding layer from scratch.



Left: IMDB example using built-in torch.nn.Embedding layer. Right: Same example using a from-scratch MyEmbedding layer.


It was almost too easy because an embedding is just a lookup table where indices represent the word/token and the rows are the embedding values. My demo greatly simplifies by leaving out options like padding and batch normalization.

class MyEmbedding(T.nn.Module):
  def __init__(self, vocab_size, embed_dim):
    super(MyEmbedding, self).__init__()
    self.weight = \
      T.nn.Parameter(T.zeros((vocab_size, embed_dim), \
        dtype=T.float32))
    T.nn.init.uniform_(self.weight, -0.10, +0.10)
    # T.nn.init.normal_(self.weight)  # mean = 0, stddev = 1

  def forward(self, x):
    return self.weight[x]

The embedding values are a matrix of trainable weights with size vocab_size by embed_dim. The built-in Embedding layer initializes using Normal with mean = 0, std dev = 1. Just to experiment, for my from-scratch MyEmbedding layer I used Uniform with range (-0.10, +0.10) initialization.

To test my custom embedding layer, I grabbed my standard IMDB movie review LSTM example. I ran the example using the built-in torch.nn.Embeding() layer, and then I edited the program to use the custom MyEmbedding() layer. Both versions worked quite well.

I searched the Internet for information about implementing an embedding layer from scratch and found wildly conflicting and contradictory information. Much of it involved one-hot encoding of the input token. My demo seems to make sense to me and to work fine but there’s a chance I could be wrong somehow.

I can’t think of any scenarios where it would be useful to implement a custom embedding layer but there might be some such situations. I did this experiment just to explore how embeddings work and increase my understanding.



Everyone has mental context embeddings. You can make all kinds of inferences about these three photos. All three women are about to be embedded in jail but only one committed a serious crime.


Demo code. Replace “lt”, “gt”, “lte”, “gte” with Boolean operator symbols. Getting the IMDB data is a major challenge. See jamesmccaffrey.wordpress.com/2022/01/17/imdb-movie-review-sentiment-analysis-using-an-lstm-with-pytorch/

# imdb_lstm.py
# uses preprocessed data instead of built-in data
# batch_first geometry
# PyTorch 1.10.0-CPU Anaconda3-2020.02  Python 3.7.6
# Windows 10/11

import numpy as np
import torch as T
device = T.device('cpu')

# -----------------------------------------------------------

class MyEmbedding(T.nn.Module):
  def __init__(self, vocab_size, embed_dim):
    super(MyEmbedding, self).__init__()
    self.weight = \
      T.nn.Parameter(T.zeros((vocab_size, embed_dim), \
        dtype=T.float32))
    T.nn.init.uniform_(self.weight, -0.10, +0.10)
    # T.nn.init.normal_(self.weight)  # mean = 0, stddev = 1

  def forward(self, x):
    return self.weight[x]

# -----------------------------------------------------------

class LSTM_Net(T.nn.Module):
  def __init__(self):
    # vocab_size = 129892
    super(LSTM_Net, self).__init__()
    # self.embed = T.nn.Embedding(129892, 32)  # built-in
    # self.embed = MyEmbedding(129892, 32)     # from scratch
    self.lstm = T.nn.LSTM(32, 100, batch_first=True)
    self.do1 = T.nn.Dropout(0.20)
    self.fc1 = T.nn.Linear(100, 1)  # binary
 
  def forward(self, x):
    # x = review/sentence. length = fixed w/ padding (front)
    z = self.embed(x)  # expand each token to 32 values
    z = z.reshape(-1, 50, 32)  # bat seq embed
    lstm_oupt, (h_n, c_n) = self.lstm(z)
    z = lstm_oupt[:,-1]  # shape [bs,100]  # [-1] is seq first
    z = self.do1(z)
    z = T.sigmoid(self.fc1(z))  # BCELoss()
    return z 

# -----------------------------------------------------------

class IMDB_Dataset(T.utils.data.Dataset):
  # 50 token IDs then 0 or 1 label, space delimited
  def __init__(self, src_file):
    all_xy = np.loadtxt(src_file, usecols=range(0,51),
      delimiter=" ", comments="#", dtype=np.int64)
    tmp_x = all_xy[:,0:50]   # cols [0,50) = [0,49]
    tmp_y = all_xy[:,50]     # all rows, just col 50
    self.x_data = T.tensor(tmp_x, dtype=T.int64).to(device) 
    self.y_data = T.tensor(tmp_y, dtype=T.float32).to(device)
    self.y_data = self.y_data.reshape(-1, 1)  # float32 2D 

  def __len__(self):
    return len(self.x_data)

  def __getitem__(self, idx):
    tokens = self.x_data[idx]
    trgts = self.y_data[idx] 
    return (tokens, trgts)

# -----------------------------------------------------------

def accuracy(model, dataset):
  # data_x and data_y are lists of tensors
  # assumes model.eval()
  num_correct = 0; num_wrong = 0
  ldr = T.utils.data.DataLoader(dataset,
    batch_size=1, shuffle=False)
  for (batch_idx, batch) in enumerate(ldr):
    X = batch[0]  # inputs
    Y = batch[1]  # target sentiment label 0 or 1

    with T.no_grad():
      oupt = model(X)  # single [0.0, 1.0]
    if oupt "lt" 0.5 and Y == 0:
      num_correct += 1
    elif oupt "gte" 0.5 and Y == 1:
      num_correct += 1
    else:
      num_wrong += 1
    
  acc = (num_correct * 100.0) / (num_correct + num_wrong)
  return acc

# -----------------------------------------------------------

def main():
  # 0. get started
  print("\nBegin PyTorch IMDB LSTM demo ")
  print("Using only reviews with 50 or less words ")
  T.manual_seed(3)  
  np.random.seed(3)

  # 1. load data 
  print("\nLoading preprocessed train and test data ")
  train_file = ".\\Data\\imdb_train_50w.txt"
  train_ds = IMDB_Dataset(train_file) 

  test_file = ".\\Data\\imdb_test_50w.txt"
  test_ds = IMDB_Dataset(test_file) 

  bat_size = 16
  train_ldr = T.utils.data.DataLoader(train_ds,
    batch_size=bat_size, shuffle=True, drop_last=False)
  n_train = len(train_ds)
  n_test = len(test_ds)
  print("Num train = %d Num test = %d " % (n_train, n_test))

# -----------------------------------------------------------

  # 2. create network
  print("\nCreating LSTM binary classifier ")
  net = LSTM_Net().to(device)

  # 3. train model
  loss_func = T.nn.BCELoss()  # binary cross entropy
  lrn_rate = 0.001
  optimizer = T.optim.Adam(net.parameters(), lr=lrn_rate)
  max_epochs = 10  #30
  log_interval = 5  # display progress

  print("\nbatch size = " + str(bat_size))
  print("loss func = " + str(loss_func))
  print("optimizer = Adam ")
  print("learn rate = %0.4f " % lrn_rate)
  print("max_epochs = %d " % max_epochs)

  print(net.embed.weight)
  input()

  print("\nStarting training ")
  net.train()  # set training mode
  for epoch in range(0, max_epochs):
    tot_err = 0.0  # for one epoch
    for (batch_idx, batch) in enumerate(train_ldr):
      X = batch[0]  # [bs,50]
      Y = batch[1]
      optimizer.zero_grad()
      oupt = net(X)
      loss_val = loss_func(oupt, Y) 
      tot_err += loss_val.item()
      loss_val.backward()  # compute gradients
      optimizer.step()     # update weights
  
    if epoch % log_interval == 0:
      print("epoch = %4d  |" % epoch, end="")
      print("   loss = %10.4f  |" % tot_err, end="")
      net.eval()
      train_acc = accuracy(net, train_ds)
      print("  acc = %8.2f%%" % train_acc)
      net.train()

  print("Training complete")

  print(net.embed.weight)
  input()

# -----------------------------------------------------------

  # 4. evaluate model
  net.eval()
  test_acc = accuracy(net, test_ds)
  print("\nAccuracy on test data = %8.2f%%" % test_acc)

  # 5. save model
  print("\nSaving trained model state")
  # fn = ".\\Models\\imdb_model.pt"
  # T.save(net.state_dict(), fn)

  # saved_model = Net()
  # saved_model.load_state_dict(T.load(fn))
  # use saved_model to make prediction(s)

  # 6. use model
  print("\nSentiment for \"the movie was a great \
waste of my time\"")
  print("0 = negative, 1 = positive ")
  review = np.array([4, 20, 16, 6, 86, 425, 7, 58, 64],
    dtype=np.int64)  # cheating . . 
  padding = np.zeros(50-len(review), dtype=np.int64)
  review = np.concatenate([padding, review])
  review = T.tensor(review, dtype=T.int64).to(device)
  
  net.eval()
  with T.no_grad():
    prediction = net(review)  # log-probs
  print("raw output : ", end="")
  print("%0.4f " % prediction.item())
  
  print("\nEnd PyTorch IMDB LSTM sentiment demo")

if __name__ == "__main__":
  main()
Posted in PyTorch | Leave a comment

PyTorch Transformers and the torch.set_num_threads() Function

Bottom line: Using the torch.set_num_threads() in a PyTorch program that has a Transformer module can significantly change the behavior of the program (in my case, for the better).

I was experimenting with a PyTorch program that uses a TransformerEncoder to do anomaly detection. See https://jamesmccaffrey.wordpress.com/2022/07/25/testing-a-transformer-based-autoencoder-anomaly-detection-system/.



This program mysteriously stopped working one day.


During training I saw this:

Starting training
epoch =    0   loss = 1658.0013
epoch =   10   loss = 945.3817
epoch =   20   loss = 467.7127
epoch =   30   loss = 277.3138
epoch =   40   loss = 202.2976
epoch =   50   loss = 160.5351
epoch =   60   loss = 130.3890
epoch =   70   loss = 108.6009
epoch =   80   loss = 92.6008
epoch =   90   loss = 81.1246
Done

The loss value steadily decreased which indicated that the network containing the TransformerEncoder was learning. Good.

Then, one morning, the exact same program on the exact same machine started showing:

Starting training
epoch =    0   loss = 1658.0013
epoch =   10   loss = 945.3808
epoch =   20   loss = 955.5437
epoch =   30   loss = 950.6722
epoch =   40   loss = 949.4239
epoch =   50   loss = 956.1644
epoch =   60   loss = 954.2082
epoch =   70   loss = 951.4501
epoch =   80   loss = 945.4361
epoch =   90   loss = 954.1126
Done

The loss value immediately got stuck and so the network was not learning. This had me baffled. Somewhat unfortunately, all my machines belong to my company and are joined to the company network domain, which means that they are constantly being updated. I assumed that one of the updates had changed something.

A couple of days later, I ran into my work pal Ricky L who is an expert with transformer architecture. I described the weirdness in my system to him. He said he wasn’t surprised and that one thing for me to try was to set the number of threads explicitly with the statement torch.set_num_threads(1).

I looked up set_num_threads() in the PyTorch documentation and found exactly three sentences:

TORCH.SET_NUM_THREADS
torch.set_num_threads(int)
Sets the number of threads used for intraop parallelism on CPU.

That wasn’t too helpful so I just added a global call at the top of my program:

# uci_trans_anomaly.py

# Transformer based reconstruction error for UCI Digits
# PyTorch 1.10.0-CPU Anaconda3-2020.02  Python 3.7.6
# Windows 10/11 

import numpy as np
import torch as T

device = T.device('cpu') 
T.set_num_threads(1)  # I added this statement

# -----------------------------------------------------------

class Transformer_Net(T.nn.Module):
  . . . etc

And viola! The program was working correctly again.

Note: You can check how many threads your machine is using by default with the torch.get_num_threads() function.

I still don’t know the exact cause of the change in behavior of my program. But the moral of the story is that calling set_num_threads() in programs that use a Transformer module might be a good idea.



A ventriloquist and his/her dummy have two output streams but only one underlying thread of execution. Three old albums from the 1960s. Left: Geraldine and Ricky. Center: Happy Harry and Uncle Weldon. Right: Chris Kirby and Terry. Do not click on image to enlarge unless you’re prepared for several years of nightmares.


Posted in PyTorch | Leave a comment

Computing Input Gradients in PyTorch Using an Explicit Input Layer

In some rare deep neural problem scenarios, it’s useful to get the gradients of the input values. I ran into this idea while exploring the fast gradient sign method (FGSM) to generate evil data items that are deliberately designed to produce a misclassification. But input gradients can be useful in other scenarios too.



I needed input gradients for generating FGSM evil data.


The standard way to get input gradients looks something like this:

. . .
for bix, batch in enumerate(train_ldr):
  X = batch[0]  # inputs
  y = batch[1]  # targets
  X.requires_grad = True
  . . . 
  loss_val.backward()  # get gradients including X
  . . .

This approach works, but feels slightly like a hack. I got the idea into my head of creating an explicit input layer. I got the idea to work, but it was quite a bit more complicated than the standard technique.

Briefly, I created a custom Identity layer used it to make a copy of the input. The key statements are:

class Identity(T.nn.Module):
  def __init__(self):
    super().__init__()  # shortcut syntax

  def forward(self, x):
    z = T.nn.Parameter(x)  # adds a gradient
    return z
. . .

class Net(T.nn.Module):
  def __init__(self):
    super(Net, self).__init__()  # old syntax
    self.identity = Identity()
    . . .

  def forward(self, x):
    self.inpt = self.identity(x)
    z = self.conv1(self.inpt)
    z = . . .

And then at any point the explicit input values can be accessed by net.inpt and their gradients by net.inpt.grad.data assuming that net is the name of the Net() object.

Ultimately, the explicit input layer with gradients approach is interesting, but I think the standard approach is more practical for most of the scenarios I can think of.

The moral of the story is that PyTorch gives you tremendous flexibility for custom neural architectures.



Three early astonishing automata with life-like flexible movement. Left: “The Artist” (1800) by Swiss watchmaker Henri Maillardet. It could draw pre-programmed pictures using a quill pen. Center: “The Writer” (1770) by Pierre Jaquet-Droz. It could write a pre-programmed custom note. Right: “Psycho” (1875) by John Maskelyne. It could play a game of whist (a simple trick-taking card game like Hearst or Spades). Unlike the first two automata, this was a magic trick. Psycho sat on a clear glass tube to show there was no midget inside. But a confederate off stage used air pressure up through the tube to control Psycho. Clever!


Complete demo code. Uses a 100-item subset of the UCI Digits dataset from archive.ics.uci.edu/ml/machine-learning-databases/optdigits/.

# uci_digits_fgsm_integrated.py

# generate adversarial data using the fast gradient
# sign method (FGSM). this version uses an explicit
# input layer (with gradiient) instead of using
# X.requires_grad=True after training.

# PyTorch 1.10.0-CPU Anaconda3-2020.02  Python 3.7.6
# Windows 10/11 

import numpy as np
import matplotlib.pyplot as plt
import torch as T

device = T.device('cpu') 

# -----------------------------------------------------------

class UCI_Digits_Dataset(T.utils.data.Dataset):
  # like 8,12,0,16, . . 15,7
  # 64 pixel values [0-16], label/digit [0-9]

  def __init__(self, src_file):
    tmp_xy = np.loadtxt(src_file, usecols=range(0,65),
      delimiter=",", comments="#", dtype=np.float32)
    tmp_x = tmp_xy[:,0:64]
    tmp_x /= 16.0  # normalize pixels to [0.0, 1.0]
    tmp_x = tmp_x.reshape(-1, 1, 8, 8)  # bs, chnls, 8x8
    tmp_y = tmp_xy[:,64]  # float32 form, must convert to int

    self.x_data = T.tensor(tmp_x, dtype=T.float32).to(device)
    self.y_data = T.tensor(tmp_y, dtype=T.int64).to(device)

  def __len__(self):
    return len(self.x_data)

  def __getitem__(self, idx):
    pixels = self.x_data[idx]
    label = self.y_data[idx]
    return (pixels, label)  # as a tuple

# -----------------------------------------------------------

class Identity(T.nn.Module):
  def __init__(self):
    super().__init__()  # shortcut syntax

  def forward(self, x):
    z = T.nn.Parameter(x)
    return z

# -----------------------------------------------------------

class CNN_Net(T.nn.Module):
  def __init__(self):
    super(CNN_Net, self).__init__()  # pre Python 3.3 syntax

    self.identity = Identity() # explicit input layer
    
    self.conv1 = T.nn.Conv2d(1, 16, 2)  # chnl-in, out, krnl
    self.conv2 = T.nn.Conv2d(16, 24, 2)

    self.fc1 = T.nn.Linear(96, 64)   # [24*2*2, x]
    self.fc2 = T.nn.Linear(64, 10)   # 10 output vals

    self.pool1 = T.nn.MaxPool2d(2, 2)  # kernel, stride
    self.drop1 = T.nn.Dropout(0.10)    # between conv    
    self.drop2 = T.nn.Dropout(0.15)    # between fc

    # default weight and bias initialization
    # therefore order of defintion maters
  
  def forward(self, x):
    # input x is Size([bs, 64])
    self.inpt = self.identity(x)       # Size([bs, 64])

    z = T.relu(self.conv1(self.inpt))  # Size([bs, 16, 7, 7])
    z = self.pool1(z)             # Size([bs, 16, 3, 3])
    z = self.drop1(z)             # Size([bs, 16, 3, 3])
    z = T.relu(self.conv2(z))     # Size([bs, 24, 2, 2])
   
    z = z.reshape(-1, 96)         # Size([bs, 96)]
    z = T.relu(self.fc1(z))
    z = self.drop2(z)
    z = T.log_softmax(self.fc2(z), dim=1)  # for NLLLoss()
    return z

# -----------------------------------------------------------

def accuracy(model, ds):
  ldr = T.utils.data.DataLoader(ds,
    batch_size=len(ds), shuffle=False)
  n_correct = 0
  for data in ldr:
    (pixels, labels) = data
    with T.no_grad():
      oupts = model(pixels)
    (_, predicteds) = T.max(oupts, 1)
    n_correct += (predicteds == labels).sum().item()

  acc = (n_correct * 1.0) / len(ds)
  return acc

# -----------------------------------------------------------

def main():
  # 0. setup
  print("\nBegin UCI Digits FGSM with PyTorch demo ")
  np.random.seed(1)
  T.manual_seed(1)

  # 1. create Dataset objects
  print("\nLoading UCI digits train and test data ")
  # train_data = ".\\Data\\uci_digits_train_100.txt"
  train_data = ".\\Data\\optdigits_train_3823.txt"
  train_ds = UCI_Digits_Dataset(train_data)
  bat_size = 4
  train_ldr = T.utils.data.DataLoader(train_ds,
    batch_size=bat_size, shuffle=True)

  test_file = ".\\Data\\digits_uci_test_1797.txt"
  test_ds = UCI_Digits_Dataset(test_file)

# -----------------------------------------------------------

  # 2. create network
  print("\nCreating CNN classifier ")
  net = CNN_Net().to(device)
  net.train()  # set mode

# -----------------------------------------------------------

  # 3. train model
  loss_func = T.nn.NLLLoss()  # log_softmax output
  lrn_rate = 0.01
  opt = T.optim.SGD(net.parameters(), lr=lrn_rate)
  max_epochs = 50  # 50 
  log_every = 10   # 5

  print("\nStarting training ")
  for epoch in range(max_epochs):
    epoch_loss = 0.0
    for bix, batch in enumerate(train_ldr):
      X = batch[0]  # 64 normalized input pixels
      Y = batch[1]  # the class label

      opt.zero_grad()
      oupt = net(X)
      loss_val = loss_func(oupt, Y)  # a tensor
      epoch_loss += loss_val.item()  # for progress display
      loss_val.backward()            # compute gradients
      opt.step()                     # update weights

    if epoch % log_every == 0:
      print("epoch = %4d   loss = %0.4f" % (epoch, epoch_loss))

  print("Done ")

# -----------------------------------------------------------

  # 4. evaluate model accuracy
  print("\nComputing model accuracy")
  net.eval()
  acc_train = accuracy(net, train_ds)  # all at once
  print("Accuracy on training data = %0.4f" % acc_train)
  
  net.eval()
  acc_test = accuracy(net, test_ds)  # all at once
  print("Accuracy on test data = %0.4f" % acc_test)

# -----------------------------------------------------------

  # 5. use model to make prediction: N/A
  
# -----------------------------------------------------------

  # 6. save model
  # print("\nSaving trained model state")
  # fn = ".\\Models\\uci_digits_model.pt"
  # T.save(net.state_dict(), fn)  

# -----------------------------------------------------------

# 7. create mutated inputs designed to trick model
  epsilon = 0.20
  print("\nCreating mutated images with epsilon = %0.2f "\
    % epsilon)
  mut_images_all = []   # all FGSM images   
  mut_images_bad = []   # images that are misclassified

  n_correct = 0; n_wrong = 0  # for mut_images_all

  test_ldr = T.utils.data.DataLoader(test_ds,
    batch_size=1, shuffle=False)
  loss_func = T.nn.NLLLoss()  # assumes log-softmax()

  for (batch_idx, batch) in enumerate(test_ldr):
    (X, y) = batch  # X = pixels, y = target label
    # X.requires_grad = True   # standard key idea
    net.zero_grad()            # zap all gradients
    oupt = net(X)
    loss_val = loss_func(oupt, y)
    loss_val.backward()        # compute gradients

    sgn = net.inpt.grad.data.sign()
    mutated = X + epsilon * sgn
    mutated = T.clamp(mutated, 0.0, 1.0)

    with T.no_grad():
      pred = net(mutated)  # 10 log-softmax logits

    pred_class = T.argmax(pred[0])
    mutated = mutated.detach().numpy()

    mut_images_all.append(mutated)    # regardless
    if pred_class.item() == y.item():
      n_correct += 1
    else:
      n_wrong += 1
      mut_images_bad.append(mutated)  # just misclassified
    
  # print(n_correct)
  # print(n_wrong)
  adver_acc = (n_correct * 1.0) / (n_correct + n_wrong)
  print("\nModel acc on mutated images = %0.4f " % adver_acc)
  num_bad = len(mut_images_bad)
  print("\nNumber misclassified mutated images = %d " \
    % num_bad)

# -----------------------------------------------------------
  
  # 8. show a test image and corresponding mutation

  idx = 33  # index of test item / evil item
  print("\nExamining test item idx = " + str(idx))
  pixels = test_ds[idx][0].reshape(8,8) 
  plt.imshow(pixels, cmap=plt.get_cmap('gray_r'))
  plt.show()  # reference image

  pixels = mut_images_all[idx].reshape(8,8)
  plt.imshow(pixels, cmap=plt.get_cmap('gray_r'))
  plt.show()  # corresponding mutated image
  
  x = test_ds[idx][0].reshape(1, 1, 8, 8)  # make a batch
  act_class = test_ds[idx][1].item()
  with T.no_grad():
    oupt = net(x)
  pred_class = T.argmax(oupt).item()

  print("\nActual class test item [idx] = " \
    + str(act_class))
  print("Pred class test item [idx] = " \
    + str(pred_class))

  x = mut_images_all[idx]
  x = T.tensor(x, dtype=T.float32).to(device)
  x = x.reshape(1, 1, 8, 8)
  with T.no_grad():
    oupt = net(x)
  pred_class = T.argmax(oupt).item()
  print("Predicted class mutated item [idx] = " \
    + str(pred_class))

# -----------------------------------------------------------

  print("\nEnd UCI Digits FGSM PyTorch demo ")

if __name__ == "__main__":
  main()
Posted in PyTorch | Leave a comment