Testing a Transformer Based Autoencoder Anomaly Detection System

For the past several months, I’ve been poking away at an idea to perform unsupervised anomaly detection using a system based on a deep neural Transformer Architecture (TA). The idea is to start with a dataset, and construct a TA encoder that creates a latent representation of the dataset. Then a neural decoder is applied to reconstruct each source data item. After reconstruction, the source data items are compared with their reconstructions. Data items that have the largest reconstruction error are tagged as anomalies.

Put another way, a TA based system is similar to a standard autoencoder reconstruction error anomaly detection except it uses a Transformer encoder instead of an encoder based on fully connected linear layers.



This screenshot shows a demo run of generating FGSM data items. Can a TA anomaly detection system find the evil data?



This screenshot shows the Transformer based anomaly detection system in action. I fetched an evil FGSM data item and placed it in in position [0] in a dataset with 100 benign items. The TA detection system found the evil item.


After many hours of experimentation, I got a TA based anomaly detection system working. My colleagues were not completely impressed. They wanted evidence that the anomaly detection system actually detects anomalies. Fair enough.

So I implemented a fast gradient sign method (FGSM) attack system to generate anomalous data. FGSM data is created in a way that it looks very similar to benign data, but the FGSM data is misclassified by a neural classification system. I ran the FGSM program and fetched the first data item produced. I salted a normal 100-item dataset with the evil data item and then ran the TA anomaly detection system. The TA system corrected found the evil FGSM data item. I was happy.

I used the UCI Digits dataset for my experiments. Each data item is an 8 by 8 image of a handwritten digit. Each of the 64 pixels is a grayscale value between 0 and 16.

The Transformer based anomaly detection system experiment was a lot of work but very interesting. Apart from being potentially useful, I learned a lot.



Anomalous facial features can be unattractive or attractive. Three actresses with facial anomalies. Left: Sophia Loren (b. 1934) has a cleft chin, a fairly common anomaly (~5% of women). Center: Emma Watson (b. 1990) has freckles, a recessive trait anomaly (~5% of people have this trait). Right: Jane Seymour (b. 1951) has Heterochromia, different color eyes. It is a rare (less than 1% of population) anomaly.


Posted in PyTorch | Leave a comment

My Top Ten Favorite Science Fiction Movies with Teratoma Creatures

A teratoma is a type of tumor that can contain fully developed tissues and organs, including hair, teeth, muscle, and bone. A dermoid cyst is similar — they can contain hair, teeth and toes/fingers.

Here are ten science fiction movies that have a teratoma / dermoid creature. Listed in no particular order. Note: This blog post was inspired by Jourdan McCaffrey. No, Jourdan does not have a teratoma growing out of her back, at least as far as I know.


1. Malignant (2021) – A young woman named Madison has visions of a killer. Is she insane or is it her parasitic twin Gabriel? I don’t usually like horror themed sci fi but this movie is pretty good. I give the movie a B+ grade.


2. How to Get Ahead in Advertising (1989) – This is a black comedy. An advertising executive named Denis grows a cynical new persona. This movie has somewhat of a cult following but I give it just a C+ grade.


3. Harry Potter and the Sorcerer’s Stone (2001) – This is the first movie in the Harry Potter series. Known as “Harry Potter and the Philosopher’s Stone” in the UK. Voldemort lives on the back of the head of Professor Quirinus Quirrell. Grade = A+.


4. The Incredible Two-Headed Transplant (1971) – A scientist adds the head of a psychopath to a mentally handicapped man. I give it a C- grade.


5. The Manitou (1978) – A woman named Karen has a large and growing tumor on her neck. It turns out to be an evil Native American shaman named Misquamacus. Good idea but weak execution. I give it a C grade.


6. The Manster (1959) – Mad Japanese scientist Dr. Suzuki, aided by his seductive assistant Tara, injects American news reporter Larry Stanford with a drug that causes an eye to grow on Stanford’s shoulder. Objectively not a good film, but it has a certain charm and I give it a B- grade.


7. The Thing with Two Heads (1972) – Wealthy Dr. Kirshner is dying but transplants his head onto convict Black Jack Moss. Some people like this movie a lot but I’m not a fan of it and give it a C- grade.


8. Total Recall (1990) – In 2084, construction worker Douglas Quaid goes to an evil mining operation on Mars and meets the resistance leader Kuato, a mutant growing out of the abdomen of his brother George. Great special effects but I don’t like ambiguous story lines so I give the movie a B+ grade.


9. Jack the Giant Slayer (2013) – Jack goes up the beanstalk and rescues princess Isabelle from of General Fallon, the two-headed leader of the giants. The movie is less than the sum of its parts and I give it a C+ grade.


10. The X-Files TV episode “Humbug” (1995) – FBI agents Mulder and Scully investigate a series of murders in a community of former circus sideshow performers. It turns out the killer is Lanny’s parasitic twin, Leonard, who is able to detach himself from Lanny’s body. I’m not a fan of this TV series but I liked this episode and give it a B- grade.


Posted in Top Ten | Leave a comment

The New Windows 11 Terminal Application – I Don’t Get It

A few days ago, I woke up in the morning and turned on my primary work laptop. It is a Microsoft Surface Book (an awesome machine). The night before, there was a Windows Update (cue sound of impending doom). I entered “cmd” in the Run dialog. I have done this literally tens of thousands of times over the past more-years-than-I-care-to-remember.


The Terminal application is a GUI container that can hold several shells. I don’t get the point.

And instead of seeing the familiar cmd shell I saw a slightly different version. What. The. Heck.

The Windows 11 Update had replaced the good old CMD shell with a new Terminal application. After a couple of minutes of fiddling around with Terminal, it became clear that Terminal is just a GUI container for multiple shells including the CMD shell and PowerShell, and tweaked versions such as Developer PowerShell for VS 2022.

OK.

But I don’t get it. Why do I need my shells combined into a GUI container?

I immediately set out to determine how to launch the old, familiar CMD shell. After a bit of Googling about, I discovered that the old CMD shell is actually the conhost.exe (console host) application and the old cmd.exe just redirected to conhost.exe. In short, I just need to enter conhost.exe in the Run dialog.

After some experimentation, I still don’t understand the purpose of the new Terminal program. I regularly use the CMD shell on Windows and the bash shell on Linux systems. And I even worked on the original PowerShell, when it’s code-name was Monad. My point is, I understand working with shells but I don’t see the advantage of placing multiple shells in a GUI container. I wear big boy developer pants and can manage multiple shells.

I hope there will be some big advantage to Terminal that I just don’t see yet, as opposed to Terminal being a Windows feature that nobody asked for. I assume there are behind-the-scenes improvements in performance and possibly security, but even so, that doesn’t explain the need for Terminal.

Note: After the same Windows update that gave me the Terminal app, my beloved laptop started acting wonky in the sense that some (but not all) of my PyTorch programs starting running slightly differently. I spent hours trying to track the problem down and still haven’t figured out the source of the new behavior. I even rolled back the update but the strange behavior continued. Grrr.



Sometimes I understand software system design choices, and sometimes I don’t. Fashion design choices are more subjective. I kind of like these three designs even though they’re clearly not functional.


Posted in Miscellaneous | Leave a comment

Fast Gradient Sign Method (FGSM) Example Using PyTorch on the UCI Digits Data

The fast gradient sign method (FGSM) is a technique to generate evil data items that are designed to trick a trained neural network classifier. I implemented a demo using PyTorch and the UCI Digits dataset.

Each UCI Digits data item is a crude 8 by 8 grayscale image of a handwritten digit from ‘0’ to ‘9’. You can find the dataset at archive.ics.uci.edu/ml/machine-learning-databases/optdigits/. There are 3823 training images and 1797 test images.

In the screenshot below, the demo begins by training a CNN network on the UCI Digits data. Then the demo uses FGSM on the 1797 test items to create 1797 evil items. The evil items are designed to look very much like the test items, but be misclassified by the model.

The trained network model scores 96.22% accuracy on the 1797 test items but only 24.76% accuracy on the evil items that were generated from the test items.

The demo displays test item [33], which is a ‘5’ digit, and the corresponding evil item [33] in visual format. The two images appear similar but the model classifies the evil image as a ‘9’ digit.

My demo didn’t take too long to put together because I used a previous example as a template. The previous example used the MNIST digits dataset. See https://jamesmccaffrey.wordpress.com/2022/07/25/fast-gradient-sign-method-fgsm-example-for-mnist-using-pytorch/.

The FGSM technique is one of those ideas that seems very complex until after you figure it out, then it seems easy. But regardless, the demo program has many tricky details. So the complexity of the FGSM demo program depends on how you look at it.



I’m not a big fan of the steampunk subculture but here are three photos that feature clever glasses. Probably not very practical for looking at things, but interesting.


Demo code. Replace “lt”, “gt”, “lte”, “gte” with Boolean operator symbols.

# uci_digits_fgsm.py

# generate adversarial data using the fast gradient
# sign method (FGSM)

# PyTorch 1.10.0-CPU Anaconda3-2020.02  Python 3.7.6
# Windows 10/11 

import numpy as np
import matplotlib.pyplot as plt
import torch as T

device = T.device('cpu') 

# -----------------------------------------------------------

class UCI_Digits_Dataset(T.utils.data.Dataset):
  # like 8,12,0,16, . . 15,7
  # 64 pixel values [0-16], label/digit [0-9]

  def __init__(self, src_file):
    tmp_xy = np.loadtxt(src_file, usecols=range(0,65),
      delimiter=",", comments="#", dtype=np.float32)
    tmp_x = tmp_xy[:,0:64]
    tmp_x /= 16.0  # normalize pixels to [0.0, 1.0]
    tmp_x = tmp_x.reshape(-1, 1, 8, 8)  # bs, chnls, 8x8
    tmp_y = tmp_xy[:,64]  # float32 form, must convert to int

    self.x_data = T.tensor(tmp_x, dtype=T.float32).to(device)
    self.y_data = T.tensor(tmp_y, dtype=T.int64).to(device)

  def __len__(self):
    return len(self.x_data)

  def __getitem__(self, idx):
    pixels = self.x_data[idx]
    label = self.y_data[idx]
    return (pixels, label)  # as a tuple

# -----------------------------------------------------------

class CNN_Net(T.nn.Module):
  def __init__(self):
    super(CNN_Net, self).__init__()  # pre Python 3.3 syntax
    self.conv1 = T.nn.Conv2d(1, 16, 2)  # chnl-in, out, krnl
    self.conv2 = T.nn.Conv2d(16, 24, 2)

    self.fc1 = T.nn.Linear(96, 64)   # [24*2*2, x]
    self.fc2 = T.nn.Linear(64, 10)   # 10 output vals

    self.pool1 = T.nn.MaxPool2d(2, 2)  # kernel, stride
    self.drop1 = T.nn.Dropout(0.10)    # between conv    
    self.drop2 = T.nn.Dropout(0.15)    # between fc

    # default weight and bias initialization
    # therefore order of defintion maters
  
  def forward(self, x):
    # input x is Size([bs, 64])
    z = T.relu(self.conv1(x))     # Size([bs, 16, 7, 7])
    z = self.pool1(z)             # Size([bs, 16, 3, 3])
    z = self.drop1(z)             # Size([bs, 16, 3, 3])
    z = T.relu(self.conv2(z))     # Size([bs, 24, 2, 2])
   
    z = z.reshape(-1, 96)         # Size([bs, 96)]
    z = T.relu(self.fc1(z))
    z = self.drop2(z)
    z = T.log_softmax(self.fc2(z), dim=1)  # for NLLLoss()
    return z

# -----------------------------------------------------------

def accuracy(model, ds):
  ldr = T.utils.data.DataLoader(ds,
    batch_size=len(ds), shuffle=False)
  n_correct = 0
  for data in ldr:
    (pixels, labels) = data
    with T.no_grad():
      oupts = model(pixels)
    (_, predicteds) = T.max(oupts, 1)
    n_correct += (predicteds == labels).sum().item()

  acc = (n_correct * 1.0) / len(ds)
  return acc

# -----------------------------------------------------------

def main():
  # 0. setup
  print("\nBegin UCI Digits FGSM with PyTorch demo ")
  np.random.seed(1)
  T.manual_seed(1)

  # 1. create Dataset objects
  print("\nLoading UCI digits train and test data ")
  # train_data = ".\\Data\\uci_digits_train_100.txt"
  train_data = ".\\Data\\optdigits_train_3823.txt"
  train_ds = UCI_Digits_Dataset(train_data)
  bat_size = 4
  train_ldr = T.utils.data.DataLoader(train_ds,
    batch_size=bat_size, shuffle=True)

  test_file = ".\\Data\\digits_uci_test_1797.txt"
  test_ds = UCI_Digits_Dataset(test_file)

# -----------------------------------------------------------

  # 2. create network
  print("\nCreating CNN classifier ")
  net = CNN_Net().to(device)
  net.train()  # set mode

# -----------------------------------------------------------

  # 3. train model
  loss_func = T.nn.NLLLoss()  # log_softmax output
  lrn_rate = 0.01
  opt = T.optim.SGD(net.parameters(), lr=lrn_rate)
  max_epochs = 50  # 50 
  log_every = 10   # 5

  print("\nStarting training ")
  for epoch in range(max_epochs):
    epoch_loss = 0.0
    for bix, batch in enumerate(train_ldr):
      X = batch[0]  # 64 normalized input pixels
      Y = batch[1]  # the class label

      opt.zero_grad()
      oupt = net(X)
      loss_val = loss_func(oupt, Y)  # a tensor
      epoch_loss += loss_val.item()  # for progress display
      loss_val.backward()            # compute gradients
      opt.step()                     # update weights

    if epoch % log_every == 0:
      print("epoch = %4d   loss = %0.4f" % (epoch, epoch_loss))

  print("Done ")

# -----------------------------------------------------------

  # 4. evaluate model accuracy
  print("\nComputing model accuracy")
  net.eval()
  acc_train = accuracy(net, train_ds)  # all at once
  print("Accuracy on training data = %0.4f" % acc_train)
  
  net.eval()
  acc_test = accuracy(net, test_ds)  # all at once
  print("Accuracy on test data = %0.4f" % acc_test)

# -----------------------------------------------------------

  # 5. use model to make prediction: N/A
  
# -----------------------------------------------------------

  # 6. save model
  # print("\nSaving trained model state")
  # fn = ".\\Models\\uci_digits_model.pt"
  # T.save(net.state_dict(), fn)  

# -----------------------------------------------------------

# 7. create inputs designed to trick model
  epsilon = 0.20
  print("\nCreating evil images from test w epsilon = %0.2f "\
    % epsilon)
  evil_images_lst = []
  n_correct = 0; n_wrong = 0

  test_ldr = T.utils.data.DataLoader(test_ds,
    batch_size=1, shuffle=False)
  loss_func = T.nn.NLLLoss()  # assumes log-softmax()

  for (batch_idx, batch) in enumerate(test_ldr):
    (X, y) = batch  # X = pixels, y = target label
    X.requires_grad = True
    oupt = net(X)
    loss_val = loss_func(oupt, y)
    net.zero_grad()  # zap all gradients
    loss_val.backward()  # compute gradients

    sgn = X.grad.data.sign()
    mutated = X + epsilon * sgn
    mutated = T.clamp(mutated, 0.0, 1.0)

    with T.no_grad():
      pred = net(mutated)  # 10 log-softmax logits
    pred_class = T.argmax(pred[0])
    if pred_class.item() == y.item():
      n_correct += 1
    else:
      n_wrong += 1

    # if batch_idx == 1:
    #   print("Predicted class of evil[1] = " + \
    #     str(pred_class.item()))

    mutated = mutated.detach().numpy()
    evil_images_lst.append(mutated)
    
  # print(n_correct)
  # print(n_wrong)
  adver_acc = (n_correct * 1.0) / (n_correct + n_wrong)
  print("\nModel acc on evil images = %0.4f " % adver_acc)

# -----------------------------------------------------------
  
  # show a test image and corresponding mutation

  idx = 33  # index of test item / evil item
  print("\nExamining test item idx = " + str(idx))
  pixels = test_ds[idx][0].reshape(8,8) 
  plt.imshow(pixels, cmap=plt.get_cmap('gray_r'))
  plt.show() 

  pixels = evil_images_lst[idx].reshape(8,8)
  plt.imshow(pixels, cmap=plt.get_cmap('gray_r'))
  plt.show() 
  
  x = test_ds[idx][0].reshape(1, 1, 8, 8)  # make it a batch
  act_class = test_ds[idx][1].item()
  with T.no_grad():
    oupt = net(x)
  pred_class = T.argmax(oupt).item()

  print("\nActual class test item [idx] = " \
    + str(act_class))
  print("Pred class test item [idx] = " \
    + str(pred_class))

  x = evil_images_lst[idx]
  x = T.tensor(x, dtype=T.float32).to(device)
  x = x.reshape(1, 1, 8, 8)
  with T.no_grad():
    oupt = net(x)
  pred_class = T.argmax(oupt).item()
  print("Predicted class evil item [idx] = " \
    + str(pred_class))

  print("\nEnd UCI Digits FGSM PyTorch demo ")

if __name__ == "__main__":
  main()
Posted in PyTorch | Leave a comment

The Difference Between Encoding, Embedding, and Latent Representation — in My World

Bottom line: In the machine learning projects I work on, an encoding converts categorical data to numeric data (example: one-hot encoding where “red” = [0 1 0 0]), an embedding converts an integer word ID to a vector (ex: “the” = 4 = [-0.1234, 1.9876, . . . 3.4681]), and a latent representation is a vector that represents a condensed version of a data item (ex: an autoencoder represents [“male”, 28, $63,000.00] as [2.3456, -0.7654, . . 0.9753]).

There are no completely standard terminology guides for machine learning. Each project, research paper, and blog post should explain what each term means. The terms “encoding”, “embedding”, and “latent representation” can be, and often are, used interchangeably.



In my world — meaning the projects I work on — my colleagues and I usually try to use the meanings I presented in the first paragraph of this blog post.

The most general term is “latent representation”. The five main unsupervised neural architectures that create a latent representation are 1.) ordinary autoencoder (AE), 2.) variational autoencoder (VAE), 3.) generative adversarial network (GAN), 4.) transformer architecture encoder, and 5.) contrastive loss network. But there are dozens of other architectures for latent representations and each of the five architectures I mentioned has dozens and dozens of variations. For example, the latent representation of an AE is a simple vector but the latent representation of a VAE is a pair of vectors that represent the mean and log-variance of the source dataset.

Terminology can help communicate ideas. But language is ambiguous and so it’s important to clearly define what is meant in any particular context.

A few years ago, neural networks were just one topic in machine learning, which was just one topic in computer science. I think maybe the moral of this blog post is that the topic of deep neural architecture is now so complex that it has become a separate field of study on par with topics such as mathematics, biochemistry, and physics. Put another way, I suspect colleges and universities will eventually offer a dedicated Bachelor’s degree in Machine Learning or Artificial Intelligence.



Three photos from a stock image search for “college classroom”. Left: This photo is baffling in so many ways but I especially like the mysterious vintage light bulbs. Center: A truly masterful compostion of fruit, math, and non-optimal running shoes. Right: I doubt that plant DNA has ever been explained more clearly.


Posted in Machine Learning | Leave a comment

UCI Digits Image Classification Using a PyTorch CNN

One of my standard neural network examples is image classification on the MNIST dataset. The full MNIST (modified National Institute of Standards and Technology) dataset has 60,000 images for training and 10,000 images for testing.

The UCI Digits dataset is similar to MNIST but smaller and easier to experiment with. Each UCI Digits image is an 8 x 8 (64 pixels) grayscale handwritten digit from ‘0’ to ‘9’. Each pixel value is an integer from 0 (white) to 16 (black).


Example UCI Digits images

The UCI Digits dataset can be found at archive.ics.uci.edu/ml/machine-learning-databases/optdigits/. The 3823-item training file is named optdigits.tra and the 1797-item test file is named optdigits.tes. The files are text files so I renamed them and added a “.txt” extensions. Each line has 65 comma-delimited values. The first 64 values are the pixels (0 to 16) and the last value on each line is the digit (0 to 9).

I created a CNN system using PyTorch. The code that defines my network is:

class CNN_Net(T.nn.Module):
  def __init__(self):
    super(CNN_Net, self).__init__()  # pre Python 3.3 syntax
    self.conv1 = T.nn.Conv2d(1, 16, 2)  # chnl-in, out, krnl
    self.conv2 = T.nn.Conv2d(16, 24, 2)

    self.fc1 = T.nn.Linear(96, 64)   # [24*2*2, x]
    self.fc2 = T.nn.Linear(64, 10)   # 10 output vals

    self.pool1 = T.nn.MaxPool2d(2, 2)   # kernel, stride
    self.drop1 = T.nn.Dropout(0.10)
    self.drop2 = T.nn.Dropout(0.15)

    # default weight and bias initialization
    # therefore order of defintion maters
  
  def forward(self, x):
    # input x is Size([bs, 64])
    z = T.relu(self.conv1(x))     # Size([bs, 16, 7, 7])
    z = self.pool1(z)             # Size([bs, 16, 3, 3])
    z = self.drop1(z)             # Size([bs, 16, 3, 3])
    z = T.relu(self.conv2(z))     # Size([bs, 24, 2, 2])
   
    z = z.reshape(-1, 96)         # Size([bs, 96)]
    z = T.relu(self.fc1(z))       # Size([bs, 64)]
    z = self.drop2(z)             # Size([bs, 64])
    z = T.log_softmax(self.fc2(z), dim=1)  # for NLLLoss()
    return z                      # Size([bs, 10])

The code is paradoxically simple and incredibly complex. The code is simple if you have implemented CNNs before because the parts — convolution layers, linear layers, pooling layers, dropout layers — are standard building blocks. However, there are literally an infinite number of ways to compose the building blocks, and furthermore, each building block has many optional parameters.

My demo code worked reasonably well and scored 99.35% accuracy on the training data (3798 of 3823 correct) and 96.22% accuracy on the test data (1729 of 1797 correct).

My motivation for creating the CNN system is related to a project I’m working on. I created a transformer-based autoencoder anomaly detection system. I used the UCI Digits dataset. To test the anomaly detection system I want to create some adversarial input data items using the Fast Gradient Sign Method (FGSM) technique, and then see if the anomaly detection system can find them when mixed with ordinary benign data items. To create FGSM items, I need access to a UCI Digits model.

Good brain exercise.



Brain Boy was a comic book series. There were only six issues, published in 1962-1963. These are covers of the first three issues.

Brain Boy was Matt Price. When his mother was pregnant, a car accident with an electrical tower killed his father and gave Matt mental powers and levitation. When he became an adult he was recruited as a government agent but he was still called “Brain Boy”, his childhood nickname.


Demo code. Replace “lt”, “gt”, “lte”, “gte” with Boolean operator symbols.

# uci_digits_cnn.py

# UCI Digits classification using a CNN
# note: intent is to use this as a basis for FGSM evil data
#  then use FGSM data to test TA anomaly detection

# PyTorch 1.10.0-CPU Anaconda3-2020.02  Python 3.7.6
# Windows 10/11 

import numpy as np
import matplotlib.pyplot as plt
import torch as T

device = T.device('cpu') 

# -----------------------------------------------------------

class UCI_Digits_Dataset(T.utils.data.Dataset):
  # like 8,12,0,16, . . 15,7
  # 64 pixel values [0-16], label/digit [0-9]

  def __init__(self, src_file):
    tmp_xy = np.loadtxt(src_file, usecols=range(0,65),
      delimiter=",", comments="#", dtype=np.float32)
    tmp_x = tmp_xy[:,0:64]
    tmp_x /= 16.0  # normalize pixels to [0.0, 1.0]
    tmp_x = tmp_x.reshape(-1, 1, 8, 8)  # bs, chnls, 8x8
    tmp_y = tmp_xy[:,64]  # float32 form, must convert to int

    self.x_data = T.tensor(tmp_x, dtype=T.float32).to(device)
    self.y_data = T.tensor(tmp_y, dtype=T.int64).to(device)

  def __len__(self):
    return len(self.x_data)

  def __getitem__(self, idx):
    pixels = self.x_data[idx]
    label = self.y_data[idx]
    return (pixels, label)  # as a tuple

# -----------------------------------------------------------

class CNN_Net(T.nn.Module):
  def __init__(self):
    super(CNN_Net, self).__init__()  # pre Python 3.3 syntax
    self.conv1 = T.nn.Conv2d(1, 16, 2)  # chnl-in, out, krnl
    self.conv2 = T.nn.Conv2d(16, 24, 2)

    self.fc1 = T.nn.Linear(96, 64)   # [24*2*2, x]
    self.fc2 = T.nn.Linear(64, 10)   # 10 output vals

    self.pool1 = T.nn.MaxPool2d(2, 2)   # kernel, stride
    self.drop1 = T.nn.Dropout(0.10)
    self.drop2 = T.nn.Dropout(0.15)

    # default weight and bias initialization
    # therefore order of defintion maters
  
  def forward(self, x):
    # input x is Size([bs, 64])
    z = T.relu(self.conv1(x))     # Size([bs, 16, 7, 7])
    z = self.pool1(z)             # Size([bs, 16, 3, 3])
    z = self.drop1(z)             # Size([bs, 16, 3, 3])
    z = T.relu(self.conv2(z))     # Size([bs, 24, 2, 2])
   
    z = z.reshape(-1, 96)         # Size([bs, 96)]
    z = T.relu(self.fc1(z))
    z = self.drop2(z)
    z = T.log_softmax(self.fc2(z), dim=1)  # for NLLLoss()
    return z

# -----------------------------------------------------------

def accuracy(model, ds):
  ldr = T.utils.data.DataLoader(ds,
    batch_size=len(ds), shuffle=False)
  n_correct = 0
  for data in ldr:
    (pixels, labels) = data
    with T.no_grad():
      oupts = model(pixels)
    (_, predicteds) = T.max(oupts, 1)
    n_correct += (predicteds == labels).sum().item()

  acc = (n_correct * 1.0) / len(ds)
  return acc

# -----------------------------------------------------------

def display_digit(ds, idx):
  # ds is a PyTorch Dataset
  data = ds[idx][0]  # [0] is the pixels, [1] is the label
  pixels = np.array(data)  # tensor to numpy
  pixels = pixels.reshape((8,8))
  for i in range(8):
    for j in range(8):
      pxl = pixels[i,j]  # or [i][j] syntax
      # print("%.2X" % pxl, end="")  # hexidecimal
      print("%3d" % pxl, end="")
    print("")

  plt.imshow(pixels, cmap=plt.get_cmap('gray_r'))
  plt.show() 
  plt.close() 

# -----------------------------------------------------------

def main():
  # 0. get started
  print("\nBegin UCI Digits CNN classification demo ")
  T.manual_seed(1)
  np.random.seed(1)

  # 1. create Dataset object
  print("\nLoading UCI digits data ")
  # train_data = ".\\Data\\uci_digits_train_100.txt"
  train_data = ".\\Data\\optdigits_train_3823.txt"
  train_ds = UCI_Digits_Dataset(train_data)
  bat_size = 4
  train_ldr = T.utils.data.DataLoader(train_ds,
    batch_size=bat_size, shuffle=True)

# -----------------------------------------------------------

  # 2. create network
  print("\nCreating CNN classifier ")
  net = CNN_Net().to(device)
  net.train()  # set mode

# -----------------------------------------------------------

  # 3. train 
  loss_func = T.nn.NLLLoss()  # log_softmax output
  lrn_rate = 0.01
  opt = T.optim.SGD(net.parameters(), lr=lrn_rate)
  max_epochs = 50
  log_every = 10

  print("\nStarting training ")
  for epoch in range(max_epochs):
    epoch_loss = 0.0
    for bix, batch in enumerate(train_ldr):
      X = batch[0]  # 64 normalized input pixels
      Y = batch[1]  # the class label

      opt.zero_grad()
      oupt = net(X)
      loss_val = loss_func(oupt, Y)  # a tensor
      epoch_loss += loss_val.item()  # for progress display
      loss_val.backward()            # compute gradients
      opt.step()                     # update weights

    if epoch % log_every == 0:
      print("epoch = %4d   loss = %0.4f" % (epoch, epoch_loss))

  print("Done ")

# -----------------------------------------------------------

  # 4. evaluate model accuracy
  print("\nComputing model accuracy")
  net.eval()
  acc_train = accuracy(net, train_ds)  # all at once
  print("Accuracy on training data = %0.4f" % acc_train)

  test_file = ".\\Data\\digits_uci_test_1797.txt"
  test_ds = UCI_Digits_Dataset(test_file)
  net.eval()
  acc_test = accuracy(net, test_ds)  # all at once
  print("Accuracy on test data = %0.4f" % acc_test)

# -----------------------------------------------------------

  # 5. save model
  # TODO

# -----------------------------------------------------------

  # 6. use model
  print("\nPredicting for 64 random pixel values ")
  x = np.random.random(64)   # in [0.0, 1.0]

  x = x.reshape(8,8)
  plt.tight_layout()
  plt.imshow(x, cmap=plt.get_cmap('gray_r'))
  plt.show()

  x = x.reshape(1, 1, 8, 8)  # make it a batch
  x = T.tensor(x, dtype=T.float32).to(device)  # to tensor
  with T.no_grad():
    oupt = net(x)  # 10 log-softmax tensor logits
  print("\nRaw output logits: ")
  print(oupt)

  oupt = oupt.numpy()  # convert to numpy array
  probs = np.exp(oupt)  # pseudo-probs
  np.set_printoptions(precision=4, suppress=True)
  print("\nOutput pseudo-probabilities: ")
  print(probs)

  pred_class = np.argmax(probs)
  print("\nPredicted class/digit: ")
  print(pred_class)
  
  print("\nEnd UCI Digits demo ")

# -----------------------------------------------------------

if __name__ == "__main__":
  main()
Posted in PyTorch | Leave a comment

Yes, TensorFlow is Dead

July 2022: the TensorFlow neural network code library is dead.

OK, that statement is somewhat of a provocative exaggeration but bear with me.

It’s impossible to get hard data about the usage of TensorFlow (and Keras) relative to the other major library, PyTorch. Even if it were possible, such usage data would be instantly out-of-date by the time it was collated and published.

But I work at a large tech company and I have a circle of about a dozen colleagues and collaborators who work with neural systems at companies including Google, Microsoft, Amazon, Facebook, and others. All of these colleagues tell me the same thing, which is essentially that no new projects are using TensorFlow and all their teams that used to use TF have switched to PyTorch (or in a few cases at Google, switched to JAX).

Of course this is a small self-selecting sample. But the information is strong enough for me to stake my reputation in the following sense: If I were in a startup company (that my livelihood depended on) or startup team within a large company (that my job depended on), I would strongly advocate for PyTorch and strongly argue that TensorFlow / Keras has no future.

There are hundreds of blog posts and comments on the Internet on the topic of PyTorch vs. TensorFlow / Keras. I use both libraries, or all three libraries depending on your point of view of what Keras is, regularly. Keras can be dismissed quickly: it operates at too-high a level to give the flexibility needed for all but the simplest scenarios. That said, the three key issues that tell me TF is dead, in my opinion, are:

1.) The not-backward-compatible TF version 2.0 was a disaster and made the terrible TF documentation even worse.
2.) PyTorch is much easier to use than TF, in part because PyTorch is essentially Python modules as opposed to TF which feels more like custom code awkwardly integrated with Python.
3.) Google, the creator of TF, is now using JAX instead of TF for most new production and research systems.

So, in my mind, TensorFlow is clearly a dead end. I suspect Google will drop new development of TF within 36 months, at most. Because there is so much existing TF code, TF will likely limp along for several years and perhaps become the COBOL of machine learning.

One of my job responsibilities at the tech company I work for is to give training to software engineers and data scientists. Starting now, I will discontinue the TF/ Keras classes I offer, and focus strictly on PyTorch.

JAX is an unknown. I’ve experimented with JAX and have the feeling that it works at too low a level. There are several efforts to make JAX closer to the level of abstraction of PyTorch. The FLAX library is one example. When one JAX-wrapper library emerges from the pack, it could be a good alternative to PyTorch.



Three images from a stock photo search for “machine learning engineer”. Left: Most ML engineers, including me, hate it when they’re using PyTorch and binary digits fly out of the screen and hit them in the face. Center: All ML engineers should have a wrench ready to debug their neural network code. Right: Most of my colleagues don’t dress quite like this, but some do use ergonomic desks.


Posted in Keras, Machine Learning, PyTorch | 2 Comments

Lightweight Mathematical Combinations Using C# in Visual Studio Magazine

I wrote an article titled “Lightweight Mathematical Combinations Using C#” in the July 2022 edition of the Microsoft Visual Studio Magazine. See https://visualstudiomagazine.com/articles/2022/07/20/math-combinations-using-csharp.aspx.

A zero-based mathematical (n, k) combination is a subset of k integers from 0 to n-1. For example, if n = 5 and k = 3 there are 10 combinations:

0 1 2
0 1 3
0 1 4
0 2 3
0 2 4
0 3 4
1 2 3
1 2 4
1 3 4
2 3 4

It’s usual to list combinations in what’s called lexicographical order. If each combination is interpreted as an ordinary integer, the combinations are listed from smallest (12) to largest (234).

The number of (n,k) combinations is n! / (k! * (n-k)!) where n! is factorial(n) = n * (n-1) * (n-2) * . . 1. The function is often called Choose(). For example, Choose(5, 3) = 5! / (3! * (5-3)!) = 120 / (6 * 2) = 120 / 12 = 10.

The number of combinations gets very, very large as n and k increase. For example, Choose(500, 100) =

204,169,423,593,847,671,561,387,240,724,193,094,
030,165,460,090,325,932,474,143,661,505,758,330,
944,415,515,994,945,349,100,113,181,687,417,345

which is significantly larger than the estimated number of atoms in the Universe.

To deal with combinations using the C# language it’s necessary to use the BigInteger data type which can handle arbitrarily large values.

Mathematical combinations are related to, but quite different from, mathematical permutations. A zero-based mathematical permutation of order n is a rearrangement of the integers from 0 to n-1. For example, one permutation of order n = 5 is (2, 0, 1, 4, 3).

In my article I explain how to implement combinations using a simple integer array, compute Choose(n, k) using the BigInteger type, display all (n,k) combinations using a Successor() function, and compute a specific combination element directly.

The examples in the article use zero-based combinations. This is convenient because in many practical uses of combinations, combination values map to array indices. In a pure mathematical context, one-based combinations are more common.



Trade stimulators were early gambling machines. They first appeared in the 1880s. They were placed in bars, cigar shops, and general stores. These early machines didn’t pay off automatically — the owner of the store would pay winners from his cash register.

The first five-card poker trade stimulators didn’t allow holding cards. When a hold feature was added, draw poker trade stimulators became very popular. Most machines had 10 cards on each of the five reels so the virtual deck had only 50 cards instead of 52. Typically the Ten of Spades and Jack of Hearts were omitted. I’m not sure how this changes the possible five-card combinations.

Left: A five-card draw poker machine from the Groetchen Company, circa 1935. Center: A machine with horizontal reels, designed by Charles Fey (1862-1944), who also introduced the first modern slot machine with automatic payout in 1895. Right: A machine from the Rock-Ola Company, circa 1935. Rock-Ola is best known for juke boxes but the company made many other kinds of coin operated games.


Posted in Miscellaneous | Leave a comment

Compiling and Running a C Language Program on MacOS

When I’m working on a Linux machine, my C language compiler of choice is gcc. However, on a MacOS machine, the clang compiler is installed by default. I hadn’t used clang for a long time so I thought I’d do an example to refresh my memory.

I launched a bash shell by typing “terminal” in the MacOS search tool. Then I issued the command clang to make sure clang was installed on my Mac machine. I got some kind of an error message saying I had to install some sort of prerequisite so I clicked on the Install option. I can’t remember what the dependency was — I should have paid more attention. I think it might have been the Xcode (IDE) Command Line Tools set of programs.


Compiling and executing a C language program on a MacOS machine. I grabbed the screenshot using Shift + Command + 4.

After the prerequisite(s) were installed, I launched the TextEdit editor and selected the create a new file option. I wrote:

#include 

int main()
{
  printf("Hello there! \n");
  printf("And goodbye . . . \n");
}

I saved the file as hello.c in my Documents directory. I cd’ed to the Documents directory and issued the command:

$ clang -o helloWord hello.c

The program compiled and then I executed the program by typing:

$ ./helloWorld

The clang compiler is actually a wrapper over the actual “LLVM” (originally “low level virtual machine”) compiler. Years ago, MacOS stopped using gcc because of some sort of licensing issues. Weirdly, if you issue the command “gcc” MacOS will redirect to the clang program. It is possible to install the real gcc compiler tools on a MacOS machine, but that’s another story.



Bubble wrapper dresses. Left: Elegant and attractive. Center: Creative and stylish. Right: Needs a bit of work.


Posted in Miscellaneous | Leave a comment

DARPA Funds the Adversarial Robustness Toolbox (ART) Library for ML Security on the Pure AI Web Site

I contributed to an article titled “DARPA Funds the Adversarial Robustness Toolbox (ART) Library for ML Security” on the Pure AI web site. See https://pureai.com/articles/2022/07/05/darpa-art.aspx.

The Adversarial Robustness Toolbox (ART) library is an open source collection of functions for machine learning security. The ART library was originally funded by a grant from the U.S. Defense Advanced Research Projects Agency (DARPA). Version 1.0 of the ART library was released in November 2019. The library has been under continuous development.

I was quoted in the article: McCaffrey noted, “It’s possible to implement ML attack and defense modules from scratch, but doing so requires expert-level programming skill and so the process is expensive and time-consuming. The ART library significantly reduces the effort required to explore ML security techniques.”

The article describes the four main types of ML attacks. Poisoning attacks insert malicious training data, which will corrupt the associated model. Inference attacks pull information from the training data, such as whether or not particular person’s information is in the dataset. Extraction attacks create a replica of a trained model. Evasion attacks feed malicious inputs to a trained model in order to produce an incorrect prediction.

Dr. McCaffrey commented, “The major challenge facing libraries like the ART library for machine learning security is balancing the tradeoff between the library’s learning curve and the benefit from using the library.” He continued, “In many cases, the effort required to learn how to use a library isn’t worth the benefit gained and so it makes more sense to implement the library functionality from scratch.”

McCaffrey added, “In my opinion, the ART library for machine learning security hits a sweet spot in the tradeoff between learning effort and information reward. I have used the ART library to get relatively junior level data scientists and engineers up to speed with machine learning attacks and defenses, and then we later use custom code for advanced security scenarios.”



Funny adversarial animal attacks. Left: goat. Center: camel. Right: duck.


Posted in Machine Learning | Leave a comment