Banknote Authentication Example Using Keras

I recently upgraded my Keras library installation to version 2.6 and so I’ve been revisiting my three basic examples: Iris Dataset (multi-class classification), Boston Housing (regression), and Banknote Authentication (binary classification). In older versions of Keras, you would install the TensorFlow engine first and then the separate Keras library second, but now Keras is included in TensorFlow.

The Banknote Authentication dataset has 1372 items. Each item represents a banknote (think Euro or dollar bill) that is authentic (class 0) or a forgery (class 1). Each line of data has four predictor values: variance of image of banknote, skewness of image, kurtosis of image, entropy of image.


A graph of some of the Banknote Authentication data — just kurtosis and entropy for the first 40 class 0 (authentic) and first 40 class 1 (forgery) items.

The Banknote data can be used as-is because all predictor values are roughly in the same range, however, I normalized each predictor value by dividing by 20 so that all values are between -1 and +1. Then I wrote a little script to randomly split the data into a 1000-item set for training (about 70% of the data) and a 372-item set for testing (about 30%).

For the neural network binary classifier, I used a 4-(8-8)-1 architecture. I used tanh() activation on the two hidden layers but I could have used relu() activation instead. I used sigmoid activation on the output node so that output values are between 0.0 and 1.0, and then an output value less than 0.5 indicates class 0 = authentic, and an output value greater than 0.5 indicates class 1 = forgery.

import numpy as np
import tensorflow as tf
from tensorflow import keras as K

print("Creating 4-(8-8)-1 neural network ")
g_init = K.initializers.glorot_uniform(seed=1)
model = K.models.Sequential()
model.add(K.layers.Dense(units=8, input_dim=4,
  activation='tanh', kernel_initializer=g_init, 
  bias_initializer='zeros')) 
model.add(K.layers.Dense(units=8,
  activation='tanh', kernel_initializer=g_init,
  bias_initializer='zeros')) 
model.add(K.layers.Dense(units=1,
  activation='sigmoid', kernel_initializer=g_init,
  bias_initializer='zeros'))

I used explicit Glorot initialization for layer weights and explicit zero-initialization for layer biases. These are the default initialization schemes used so I could have omitted the explicit initialization. I prefer explicit initialization — I think it’s more clear, and guards against confusion if the default initialization scheme changes.

I hit a few minor glitches as expected (deprecated parameter names, etc.) but I was ale to fix these glitches quickly. This is a big advantage of experience with Keras or any other machine learning library — you make fewer mistakes, but more importantly, over time you learn how to correct mistakes quickly.

Good fun!



Dealing with mistakes is a part of any kind of software development, including the development of machine learning systems. According to Wikipedia, cellophane was invented as a result of a mistake.

Cellophane is a thin, transparent sheet made of regenerated wood or cotton cellulose. Cellophane was invented by Jacques Brandenberger in 1900. He was inspired by seeing wine spill on a restaurant tablecloth, and he decided to create a cloth that could deal with that type of mistake.

Cellophane is sometimes used for contemporary women’s fashion (left and right images) but has been around for a long time (center image is from 1933).


Code below.

# banknote_tfk.py
# Banknote classification
# Keras 2.6.0 in TensorFlow 2.6.0 ("_tfk")
# Anaconda3-2020.02  Python 3.7.6  Windows 10

import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'  # suppress CPU warn

import numpy as np
import tensorflow as tf
from tensorflow import keras as K

class MyLogger(K.callbacks.Callback):
  def __init__(self, n):
    self.n = n   # print loss and acc every n epochs

  def on_epoch_end(self, epoch, logs={}):
    if epoch % self.n == 0:
      curr_loss =logs.get('loss')
      curr_acc = logs.get('accuracy') * 100
      print("epoch = %4d  loss = %0.6f  acc = %0.2f%%" % \
        (epoch, curr_loss, curr_acc))

def main():
  print("\nBanknote Authentication using Keras example ")
  np.random.seed(1)
  tf.random.set_seed(1)

  # 1. load data
  print("Loading Banknote data into memory ")
  train_file = ".\\Data\\banknote_train.txt"
  train_x = np.loadtxt(train_file, delimiter='\t',
    usecols=[0,1,2,3], dtype=np.float32)
  train_y = np.loadtxt(train_file, delimiter='\t',
    usecols=[4], dtype=np.float32)

  test_file = ".\\Data\\banknote_test.txt"
  test_x = np.loadtxt(test_file, delimiter='\t',
    usecols=[0,1,2,3], dtype=np.float32)
  test_y = np.loadtxt(test_file, delimiter='\t',
    usecols=[4], dtype=np.float32)

  # 2. define 4-(x-x)-1 deep NN model
  print("\nCreating 4-(8-8)-1 neural network ")
  g_init = K.initializers.glorot_uniform(seed=1)
  model = K.models.Sequential()
  model.add(K.layers.Dense(units=8, input_dim=4,
    activation='tanh', kernel_initializer=g_init, 
    bias_initializer='zeros')) 
  model.add(K.layers.Dense(units=8,
    activation='tanh', kernel_initializer=g_init,
    bias_initializer='zeros')) 
  model.add(K.layers.Dense(units=1,
    activation='sigmoid', kernel_initializer=g_init,
    bias_initializer='zeros'))  

  # 3. compile model
  opt = K.optimizers.SGD(learning_rate=0.01)  
  model.compile(loss='binary_crossentropy',
    optimizer=opt, metrics=['accuracy'])  

  # 4. train model
  max_epochs = 100
  log_every = 10
  my_logger = MyLogger(log_every)
  print("\nStarting training ")
  h = model.fit(train_x, train_y, batch_size=32,
    epochs=max_epochs, verbose=0, callbacks=[my_logger]) 
  print("Training finished ")

  # 5. evaluate model
  # np.set_printoptions(precision=4, suppress=True)
  eval_results = model.evaluate(test_x, test_y, verbose=0) 
  print("\nLoss, accuracy on test data: ")
  print("%0.4f %0.2f%%" % (eval_results[0], \
eval_results[1]*100))

  # 6. save model
  print("\nSaving trained model as banknote_model.h5 ")
  # mp = ".\\Models\\banknote_model.h5"
  # model.save(mp)

  # 7. make a prediction
  np.set_printoptions(formatter={'float': '{: 0.4f}'.format})
  inpts = np.array([[0.5, 0.5, 0.5, 0.5]], dtype=np.float32)
  pred = model.predict(inpts)
  print("\nPredicting authenticity for: ")
  print(inpts) 
  print("Probability of class 1 (forgery) = %0.4f " % pred)

if __name__=="__main__":
  main()
Posted in Keras | Leave a comment

Example of a PyTorch Custom Layer

When I create neural software systems, I most often use the PyTorch library. The Keras library is very good for basic neural systems but for advanced architectures I like the flexibility of PyTorch. Using raw TensorFlow without Keras is an option, but I am more comfortable using the PyTorch APIs.


An example of a custom NoisyLinear() layer. Notice the two outputs are slightly different.

I hadn’t looked at the problem of creating a custom PyTorch Layer in several months, so I figured I’d code up a demo. The most fundamental layer is Linear(). For a 4-7-3 neural network (four input nodes, one hidden layer with seven nodes, three output nodes), a definition could look like:

import torch as T

class Net(T.nn.Module):
  def __init__(self):
    super(Net, self).__init__()
    self.hid1 = T.nn.Linear(4, 7)  # 4-7-3
    self.oupt = T.nn.Linear(7, 3)  # default init

  def forward(self, x):
    z = T.tanh(self.hid1(x))
    z = self.oupt(z)
    return z

For my demo, I decided to create a custom NoisyLinear() layer that works just like a standard Linear() layer but injects randomness. This isn’t particularly useful by itself but I’m just experimenting. So I wanted a 4-7-3 network to work like this:

class Net(T.nn.Module):
  def __init__(self):
    super(Net, self).__init__()
    self.hid1 = NoisyLinear(4, 7)  # 4-7-3
    self.oupt = NoisyLinear(7, 3)

  def forward(self, x):
    z = T.tanh(self.hid1(x))
    z = self.oupt(z) 
    return z

In other words, everything is the same except I use the program defined NoisyLinear() instead of the built-in torch.nn.Linear() layer. The custom layer definition I came up with is:

class NoisyLinear(T.nn.Module):
  def __init__(self, n_in, n_out):
    super().__init__()
    self.n_in, self.n_out = n_in, n_out

    self.weights = T.nn.Parameter(T.zeros((n_out, n_in),
      dtype=T.float32))
    self.bias = T.nn.Parameter(T.tensor(n_out,
      dtype=T.float32))
    self.lo = 0.90; self.hi = 0.98  # noise

    lim = 0.01  # initialize weights and bias
    T.nn.init.uniform_(self.weights, -lim, +lim)
    T.nn.init.uniform_(self.bias, -lim, +lim)

  def forward(self, x):
    wx= T.mm(x, self.weights.t())
    rnd = (self.hi - self.lo) * T.rand(1) + self.lo
    return rnd * T.add(wx, self.bias)  # wts * x + bias

The Parameter() class makes the weights and the bias trainable. I used basic uniform initialization with hard-coded range [-0.01, +0.01]. The forward() method computes weights * inputs + bias as usual, but then multiples the results by random noise in the range [0.90, 0.98]. Each time the forward() method of a NoisyLayer() layer instance is called, the result will be slightly different.

Writing a custom layer for PyTorch is rarely needed, but compared to alternative libraries, customizing PyTorch is relatively easier — with an emphasis on “relatively”.



Three well-known custom cars. Left: Dodge Deodora (1965). Center: Norman Timbs Special (1947). Right: Chrysler Thunderbolt (1941).


Complete demo code below. Long.

# iris_noisy_layer.py
# creating a custom "NoisyLinear" layer
# PyTorch 1.9.0-CPU Anaconda3-2020.02  Python 3.7.6
# Windows 10 

import numpy as np
import torch as T

device = T.device("cpu")  # to Tensor or Module

# -----------------------------------------------------------

class NoisyLinear(T.nn.Module):
  def __init__(self, n_in, n_out):
    super().__init__()
    self.n_in, self.n_out = n_in, n_out

    self.weights = T.nn.Parameter(T.zeros((n_out, n_in),
      dtype=T.float32))
    self.bias = T.nn.Parameter(T.tensor(n_out,
      dtype=T.float32))
    self.lo = 0.90; self.hi = 0.98  # noise

    lim = 0.01  # initialize weights and bias
    T.nn.init.uniform_(self.weights, -lim, +lim)
    T.nn.init.uniform_(self.bias, -lim, +lim)

  def forward(self, x):
    wx= T.mm(x, self.weights.t())
    rnd = (self.hi - self.lo) * T.rand(1) + self.lo
    return rnd * T.add(wx, self.bias)  # wts * x + bias

# -----------------------------------------------------------

class IrisDataset(T.utils.data.Dataset):
  def __init__(self, src_file, num_rows=None):
    # 5.0, 3.5, 1.3, 0.3, 0
    tmp_x = np.loadtxt(src_file, max_rows=num_rows,
      usecols=range(0,4), delimiter=",", skiprows=0,
      dtype=np.float32)
    tmp_y = np.loadtxt(src_file, max_rows=num_rows,
      usecols=4, delimiter=",", skiprows=0,
      dtype=np.int64)

    self.x_data = T.tensor(tmp_x, dtype=T.float32)
    self.y_data = T.tensor(tmp_y, dtype=T.int64)

  def __len__(self):
    return len(self.x_data)

  def __getitem__(self, idx):
    if T.is_tensor(idx):
      idx = idx.tolist()
    preds = self.x_data[idx]
    spcs = self.y_data[idx] 
    sample = { 'predictors' : preds, 'species' : spcs }
    return sample

# -----------------------------------------------------------

class Net(T.nn.Module):
  def __init__(self):
    super(Net, self).__init__()
    self.hid1 = NoisyLinear(4, 7)  # 4-7-3
    self.oupt = NoisyLinear(7, 3)

  def forward(self, x):
    z = T.tanh(self.hid1(x))
    z = self.oupt(z)  # no softmax: CrossEntropyLoss() 
    return z

# -----------------------------------------------------------

def accuracy(model, dataset):
  # assumes model.eval()
  dataldr = T.utils.data.DataLoader(dataset, batch_size=1,
    shuffle=False)
  n_correct = 0; n_wrong = 0
  for (_, batch) in enumerate(dataldr):
    X = batch['predictors'] 
    # Y = T.flatten(batch['species'])
    Y = batch['species']  # already flattened by Dataset
    with T.no_grad():
      oupt = model(X)  # logits form

    big_idx = T.argmax(oupt)
    # if big_idx.item() == Y.item():
    if big_idx == Y:
      n_correct += 1
    else:
      n_wrong += 1

  acc = (n_correct * 1.0) / (n_correct + n_wrong)
  return acc

# -----------------------------------------------------------

def main():
  # 0. get started
  print("\nBegin Iris custom NoisyLinear layer demo \n")
  T.manual_seed(1)
  np.random.seed(1)
  
  # 1. create Dataset and DataLoader objects
  print("Creating Iris train DataLoader ")

  train_file = ".\\Data\\iris_train.txt"
  train_ds = IrisDataset(train_file, num_rows=120)

  bat_size = 4
  train_ldr = T.utils.data.DataLoader(train_ds,
    batch_size=bat_size, shuffle=True)

  # 2. create network
  net = Net().to(device)

  # 3. train model
  max_epochs = 20
  ep_log_interval = 4
  lrn_rate = 0.05

  loss_func = T.nn.CrossEntropyLoss()  # applies softmax()
  optimizer = T.optim.SGD(net.parameters(), lr=lrn_rate)

  print("\nbat_size = %3d " % bat_size)
  print("loss = " + str(loss_func))
  print("optimizer = SGD")
  print("max_epochs = %3d " % max_epochs)
  print("lrn_rate = %0.3f " % lrn_rate)

  print("\nStarting training")
  net.train()
  for epoch in range(0, max_epochs):
    epoch_loss = 0  # for one full epoch
    num_lines_read = 0

    for (batch_idx, batch) in enumerate(train_ldr):
      X = batch['predictors']  # [10,4]
      Y = batch['species']  # OK; alreay flattened

      optimizer.zero_grad()
      oupt = net(X)
      loss_val = loss_func(oupt, Y)  # a tensor
      epoch_loss += loss_val.item()  # accumulate
      loss_val.backward()            # gradients
      optimizer.step()               # update wts

    if epoch % ep_log_interval == 0:
      print("epoch = %4d   loss = %0.4f" % (epoch, epoch_loss))
  print("Done ")

  # 4. evaluate model accuracy
  print("\nComputing model accuracy")
  net.eval()
  acc = accuracy(net, train_ds)  # item-by-item
  print("Accuracy on train data = %0.4f" % acc)

  # 5. make a prediction
  print("\nPredicting species for [6.1, 3.1, 5.1, 1.1]: ")
  x = np.array([[6.1, 3.1, 5.1, 1.1]], dtype=np.float32)
  x = T.tensor(x, dtype=T.float32).to(device) 

  with T.no_grad():
    logits = net(x).to(device)  # values do not sum to 1.0
  probs = T.softmax(logits, dim=1).to(device)
  T.set_printoptions(precision=4)
  print(probs)

  print("\nPredicting again for [6.1, 3.1, 5.1, 1.1]: ")
  x = np.array([[6.1, 3.1, 5.1, 1.1]], dtype=np.float32)
  x = T.tensor(x, dtype=T.float32).to(device) 

  with T.no_grad():
    logits = net(x).to(device)  # values do not sum to 1.0
  probs = T.softmax(logits, dim=1).to(device)
  T.set_printoptions(precision=4)
  print(probs)

  print("\nEnd custom NoisyLinear layer demo")

if __name__ == "__main__":
  main()
Posted in PyTorch | Leave a comment

Computing Model Accuracy for Keras Regression Models

I recently upgraded my Keras neural network code library version to 2.6.0 and decided to revisit my three basic examples — Iris (multi-class classification), Banknote (binary classification), and Boston (regression). This morning I refactored my Boston example.

Even though it had only been a few months since I last did the Boston example, I was surprised at how much Keras had changed, and how much my preferred techniques had changed.

The goal of the Boston Housing Dataset example is to predict the median house price in one of 506 towns near Boston. There are 13 predictor variables — average number of rooms in a house in the town, tax rate in the town, percentage of Black people in town, and so on.

I used order-magnitude normalization on the numeric predictors, then randomly split the 506-item into a training set (400 items) and a test set (106 items). Preparing even simple data like this is tedious and very time-consuming. See https://jamesmccaffrey.wordpress.com/2021/08/18/preparing-the-boston-housing-dataset-for-pytorch/.

For my new version of the Boston example, one of the main changes I made was to write a much faster function to compute model accuracy. When you do classification with Keras, like the Iris example, you get a built-in accuracy function. But with regression, you must write your own function. For classification, a prediction of a discrete value like “red” is either correct or incorrect. But for regression, a prediction of a house price like 0.49500 = $49,500 will never be perfectly correct so you must determine if the prediction is within a specified percent of the target value.

When I use the PyTorch library, my standard approach for regression accuracy is:

loop
  get a line of input data
  get target value
  feed input to model, get predicted value
  if predicted is close enough to target
    num_correct += 1
  else
    num_wrong += 1
  end-if
end-loop
return num_correct / num_correct + num_wrong

This approach examines one data item at a time. The technique is simple, and allows you to examine each item to see why a prediction was correct or wrong. But when using Keras, the item-by-item approach is painfully slow. I’m not sure why Keras is roughly 10 times slower than Keras when computing accuracy item-by-item.

So, a quick approach is useful. The idea is to compute all outputs at once.

read all input data
read all target values
feed all input, get all predicted values
compare all predicted to all targets
 ( result is like [1, 0, 0, 1, . . ] )
sum the comparison results
return sum / num_items

The ideas are simple and the code is short, but implementation is very, very tricky. Here is my implementation for the Boston data:

def accuracy_quick(model, data_x, data_y, pct):
  n = len(data_x)
  oupt = model(data_x)
  oupt = tf.reshape(oupt, [-1])  # 1D
 
  max_deltas = tf.abs(pct * data_y)    # max allowable deltas
  abs_deltas = tf.abs(oupt - data_y)   # actual differences
  results = abs_deltas "lt" max_deltas    # [True, False, . .]

  n_correct = np.sum(results)
  acc = n_correct / n
  return acc

Even though a quick accuracy function implementation is tricky, once you know how to compute quick accuracy for one specific regression problem, it’s relatively easy to adapt the code to any other regression problem.



The digital artist who goes by the name “batjorge” uses fractal generation software (Mandelbulb) to create interesting images of alien mushrooms and fungi. Are the images accurate? For art, the concept of accuracy doesn’t usually apply.


Code below. Very long. Continue reading

Posted in Keras | Leave a comment

Another Set of Beautiful Machine Learning Visualizations from Thorsten Kleppe

Thorsten Kleppe is a fellow machine learning enthusiast who creates beautiful ML visualizations. Thorsten sent me some of his latest work.

Thorsten’s new visualizations are based on a logistic regression model applied to the MNIST dataset.

The MNIST dataset contains 70,000 images (60,000 for training and 10,000 for testing). Each image is a handwritten digit from ‘0’ to ‘9’. Each image is only 28×28 pixels (784 pixels total) and each pixel value is a grayscale number between 0 (usually represented as white) and 255 (black). Given a set of 784 pixel values, the goal is to create a model that predicts the correct class, ‘0’ to ‘9’.

Here are 100 examples of MNIST digits.

The usual way to create a classification model for MNIST data is to use a deep neural network with convolution layers (a CNN). But Thorsten applied simple logistic regression. Logistic regression is designed for binary classification but there are several ways to extend logistic regression to handle multi-class classification problems like the MNIST dataset.

I think Thorsten’s logistic regression model is 784-10 and if so, there are 784 input nodes (one for each pixel) and 10 output nodes (presumably the pseudo-probability of each of the 10 digits). Each input node has one associated weight.

Thorsten’s visualizations and captions (above) mostly speak for themselves. He trained his model for different numbers of training epochs, which produced varying classification accuracies. He visualized the 784 weights of the 784 input nodes, where red is a low weight value (close to 0) and green are larger values.

Very interesting stuff. Visualizing ML model weights is part of “model interpretability” — explaining why a model made a particular prediction.



Posted in Machine Learning | 1 Comment

An Example of a Bayesian Neural Network Using PyTorch

A regular neural network has a set of numeric constants called weights which determine the network output. If you feed the same input to a regular trained neural network, you will get the same output every time.

In a Bayesian neural network, each weight is probability distribution instead of a fixed value. Each time you feed an input to a Bayesian network, the weight will be slightly different and so you get slightly different output each time, even for the same input.


A Bayesian neural network for the Iris dataset. The demo predicts the class probabilities three times for input = [5.0, 2.0, 3.0, 2.0] and gets three slightly different results because the weights are distributions instead of fixed values.

At first thought this doesn’t seem useful at all. There are two advantages to a Bayesian neural network. First, the weights variability greatly deters model overfitting. Second, if you look at multiple output values from one input, and you see very different results, this means the network is not sure of its prediction, and you can deal with such “I don’t know” predictions.

The two main disadvantages of Bayesian neural networks are 1.) they are extremely complicated to implement, and 2.) they are more difficult to train.

The most common approach for creating a Bayesian neural network is to use a standard neural library, such as PyTorch or Keras, plus a Bayesian library such as Pyro. These Bayesian libraries are complex and have a steep learning curve. I recently stumbled across a lightweight Bayesian network library for PyTorch that allowed me to explore Bayesian neural networks. The library was created by a single guy, “Harry24k”, and is very, very impressive. The library is called torchbnn and was at: https://github.com/Harry24k/bayesian-neural-network-pytorch.

I installed the torchbnn library via pip without trouble. The torchbnn GitHub repository had a nice, simple example in the documentation that worked first time — a minor miracle when working with complex Python libraries. No, I take that back — it’s a major miracle.

I refactored the simple documentation example because that’s how I learn best. The example creates a classifier for the Iris dataset. The key code for the neural network definition is:

import numpy as np
import torch as T
import torchbnn as bnn
device = T.device("cpu")

class BayesianNet(T.nn.Module):
  def __init__(self):            # 4-100-3
    super(BayesianNet, self).__init__()
    self.hid1 = bnn.BayesLinear(prior_mu=0, prior_sigma=0.1,
      in_features=4, out_features=100)
    self.oupt = bnn.BayesLinear(prior_mu=0, prior_sigma=0.1,
      in_features=100, out_features=3)

  def forward(self, x):
    z = T.relu(self.hid1(x))
    z = self.oupt(z)  # no softmax: CrossEntropyLoss() 
    return z

The network is 4-100-3 (four inputs — sepal length and width, and petal length and width), 100 hidden units, three outputs — setosa, versicolor, virginica). Instead of using standard torch.nn.Linear() layers, you use torchbnn.BayesLinear() layers. This gives you weights and biases that are distributions instead of regular tensors. You must specify the initial distribution mean (mu) and standard deviation (sigma).

When training the Bayesian neural network, the key code is:

X = batch['predictors']  # inputs
Y = batch['species']     # targets
optimizer.zero_grad()
oupt = net(X)            # outputs

cel = ce_loss(oupt, Y)   # regular loss
kll = kl_loss(net)       # distribution loss
tot_loss = cel + (0.10 * kll)

tot_loss.backward()      # compute gradients
optimizer.step()         # update wt distributions

Bayesian neural networks have been around for a long time. But they aren’t used very often in practice. I strongly suspect the main reason why they’re not used often is that they’re just too difficult to work with. But if relatively simple libraries like the torchbnn one I found were more common, I think that Bayesian neural networks might gain greater popularity.



Loosely speaking, the term Bayesian means “based on probability”. The entire city of Las Vegas is based on probability. Left: Western Airlines (1926-1987). Center: Bonanza Airlines (1945-1968). Right: National Airlines (1934-1980). All three were major, successful airlines, but are gone now. A cautionary note to all major, successful companies.


Code below. Very long.

# iris_bayesian_01b.py

# uses Bayesian library from:
# https://github.com/Harry24k/bayesian-
# neural-network-pytorch/blob/master/demos/
# Bayesian%20Neural%20Network%20Classification.ipynb
# pip install torchbnn

import numpy as np
import torch as T
import torchbnn as bnn

device = T.device("cpu")

# -----------------------------------------------------------

class IrisDataset(T.utils.data.Dataset):
  def __init__(self, src_file, num_rows=None):
    # like 5.0, 3.5, 1.3, 0.3, 0
    tmp_x = np.loadtxt(src_file, max_rows=num_rows,
      usecols=range(0,4), delimiter=",", skiprows=0,
      dtype=np.float32)
    tmp_y = np.loadtxt(src_file, max_rows=num_rows,
      usecols=4, delimiter=",", skiprows=0,
      dtype=np.int64)

    self.x_data = T.tensor(tmp_x, dtype=T.float32)
    self.y_data = T.tensor(tmp_y, dtype=T.int64)

  def __len__(self):
    return len(self.x_data)

  def __getitem__(self, idx):
    if T.is_tensor(idx):
      idx = idx.tolist()
    preds = self.x_data[idx]
    spcs = self.y_data[idx] 
    sample = { 'predictors' : preds, 'species' : spcs }
    return sample

# -----------------------------------------------------------

class BayesianNet(T.nn.Module):
  def __init__(self):            # 4-100-3
    super(BayesianNet, self).__init__()
    self.hid1 = bnn.BayesLinear(prior_mu=0, prior_sigma=0.1,
      in_features=4, out_features=100)
    self.oupt = bnn.BayesLinear(prior_mu=0, prior_sigma=0.1,
      in_features=100, out_features=3)

  def forward(self, x):
    z = T.relu(self.hid1(x))
    z = self.oupt(z)  # no softmax: CrossEntropyLoss() 
    return z

# -----------------------------------------------------------

def accuracy(model, dataset):
  # assumes model.eval()
  dataldr = T.utils.data.DataLoader(dataset, batch_size=1,
    shuffle=False)
  n_correct = 0; n_wrong = 0
  for (_, batch) in enumerate(dataldr):
    X = batch['predictors'] 
    Y = batch['species']  # already flattened by Dataset
    with T.no_grad():
      oupt = model(X)  # logits form

    big_idx = T.argmax(oupt)
    # if big_idx.item() == Y.item():
    if big_idx == Y:
      n_correct += 1
    else:
      n_wrong += 1

  acc = (n_correct * 1.0) / (n_correct + n_wrong)
  return acc

# -----------------------------------------------------------

def accuracy_quick(model, dataset):
  n = len(dataset)
  X = dataset[0:n]['predictors']  # all X 
  Y = T.flatten(dataset[0:n]['species'])  # 1-D

  with T.no_grad():
    oupt = model(X)
  arg_maxs = T.argmax(oupt, dim=1)  # collapse cols
  num_correct = T.sum(Y==arg_maxs)
  acc = (num_correct * 1.0 / len(dataset))
  return acc.item()

# -----------------------------------------------------------

def main():
  print("\nBegin Bayesian neural network Iris demo ")
  # 0. prepare
  np.random.seed(1)
  T.manual_seed(1)
  np.set_printoptions(precision=4, suppress=True, sign=" ")
  np.set_printoptions(formatter={'float': '{: 0.4f}'.format})

  # 1. load training data
  print("\nCreating Iris train Dataset and DataLoader ")
  train_file = ".\\Data\\iris_train.txt"
  train_ds = IrisDataset(train_file, num_rows=120)

  bat_size = 4
  train_ldr = T.utils.data.DataLoader(train_ds,
    batch_size=bat_size, shuffle=True)

  # 2. create network
  net = BayesianNet().to(device)

  # 3. train model (could put this into a train() function)
  max_epochs = 100
  ep_log_interval = 10

  ce_loss = T.nn.CrossEntropyLoss()   # applies softmax()
  kl_loss = bnn.BKLLoss(reduction='mean', last_layer_only=False)
  optimizer = T.optim.Adam(net.parameters(), lr=0.01)

  print("\nbat_size = %3d " % bat_size)
  print("loss = highly customized ")
  print("optimizer = Adam 0.01")
  print("max_epochs = %3d " % max_epochs)

  print("\nStarting training")
  net.train()
  for epoch in range(0, max_epochs):
    epoch_loss = 0  # for one full epoch
    num_lines_read = 0

    for (batch_idx, batch) in enumerate(train_ldr):
      X = batch['predictors']  # [10,4]
      Y = batch['species']  # alreay flattened
      optimizer.zero_grad()
      oupt = net(X)

      cel = ce_loss(oupt, Y)
      kll = kl_loss(net)
      tot_loss = cel + (0.10 * kll)

      epoch_loss += tot_loss.item()  # accumulate
      tot_loss.backward()  # update wt distribs
      optimizer.step()

    if epoch % ep_log_interval == 0:
      print("epoch = %4d   loss = %0.4f" % (epoch, epoch_loss))
  print("Training done ")

  # 4. evaluate model accuracy
  print("\nComputing Bayesian network model accuracy")
  net.eval()
  acc = accuracy_quick(net, train_ds)  # item-by-item
  print("Accuracy on train data = %0.4f" % acc)

  # 5. make a prediction
  print("\nPredicting species for [5.0, 2.0, 3.0, 2.0]: ")
  x = np.array([[5.0, 2.0, 3.0, 2.0]], dtype=np.float32)
  x = T.tensor(x, dtype=T.float32).to(device) 

  for i in range(3):
    with T.no_grad():
      logits = net(x).to(device)  # values do not sum to 1.0
    probs = T.softmax(logits, dim=1).to(device)
    print(probs.numpy())

  print("\nEnd Bayesian network demo ")

if __name__ == "__main__":
  main()
Posted in PyTorch | Leave a comment

A Quick Look at Azure Functions

I was working on a project that uses Azure Functions. An Azure Function is an example of (the wildly misnamed) “serverless technology”. An Azure Function lives in the Cloud, accepts HTTP requests, and gives an HTTP response.

I hadn’t used Azure Functions for several months so I did a quick Hello World to see if anything had changed. First, I installed the command line tool for Azure Functions from https://www.npmjs.com/package/azure-functions-core-tools. The installer is an msi file so after I downloaded it (func-cli-x64.msi). I double-clicked on it and the install wizard ran without trouble.

Next, I created a new project named HttpHelloWorld by opening a shell, navigating to a directory where I wanted my code to live, and issued the commands:

func init --worker-runtime dotnet

func new --name HttpHelloWorld --language c# --template httptrigger

Azure Functions run on Azure, which costs money, but fortunately you can test them locally. I issued the command:

func start

The command searches the current directory for something to run, found it, and started up a local Web service (that simulates what would happen on Azure) which listens for HTTP requests.

I opened a browser and issued the request

http://localhost:7071/api/HttpHelloWorld?name=James

The Azure Function received the request and returned the response:

Hello, James. This HTTP triggered function executed successfully.

Designing and implementing an application using Azure Functions and serverless technologies has a completely different feel to it than working in a traditional computing environment. Not better or worse — just different. Ordinary Azure Functions are quite limited. A special kind, called Azure Durable Functions, are essentially class objects. They can maintain state and so they’re more useful.



The math term “function” is named after the adjective functional which means “designed to be practical and useful, rather than attractive”. Some things are purely functional, and some things are purely artistic. Here are three examples of semi-functional shoes. From Japan. Of course.


The HelloWorld template code:


public static class HttpHelloWorld
{
  [FunctionName("HttpHelloWorld")]
  public static async Task Run(
    [HttpTrigger(AuthorizationLevel.Function, "get", "post")] HttpRequest req,
    ILogger log)
  {
    log.LogInformation("C# HTTP trigger function processed a request.");

    string name = req.Query["name"];

    string requestBody = await new StreamReader(req.Body).ReadToEndAsync();
    dynamic data = JsonConvert.DeserializeObject(requestBody);
    name = name ?? data?.name;

    string responseMessage = string.IsNullOrEmpty(name)
      ? "This HTTP triggered function executed successfully. Pass a name in the" +
        " query string or in the request body for a personalized response."
      : $"Hello, {name}. This HTTP triggered function executed successfully.";

    return new OkObjectResult(responseMessage);
  }
}
Posted in Miscellaneous | Leave a comment

Installing Keras 2.6 on Windows and Running the Iris Example

One of the biggest challenges in machine learning is staying up to date with new releases of code libraries. I noticed that Keras released a new version 2.6 a few days ago so I figured I’d do a complete end-to-end example of installation and the Iris Dataset classification problem. I use Windows but the process is almost the same for Linux and Mac.

In the old days, you would install TensorFlow first, then Keras second. Now TensorFlow contains Keras.

First, I installed the Anaconda3-2020.02 distribution which contains Python 3.7.6 and over 500 Python packages. But before I did this, I uninstalled a couple of phantom Pythons (3.5, 3.6) that appeared on my machine. Many programs “helpfully” install a Python for you without being asked, which creates versioning collisions and errors.


Left: I uninstalled an old Python to prevent Python version collisions. Right: I like the Anaconda distribution of Python.

Next I looked for the correct TensorFlow/Keras 2.6.0 whl (“wheel”) installation file. The default .whl file at pypi.org gives you a combined CPU – GPU version. If you install the dual version, when you run a program, TensorFlow/Keras will attempt to determine if your machine has a GPU or not and then run the correct version. Because I was installing on a machine that didn’t have a GPU, I looked for a CPU-only version of TensorFlow/Keras. I eventually found a whl file for a CPU-only version of TensorFlow/Keras — but finding it wasn’t easy.



The standard location for the whl install files gives only the combined GPU-CPU version of TensorFlow/Keras. It’s very easy to get the wrong version of anythind related to machine learning — versioning hell is a significant problem.


Left: While searching for a CPU-only whl file for Windows, I found a URL for a CPU-only version for Linux. Right: After a bit of experimenting with the URL, I eventually found a CPU-only version for Windows at https://storage.googleapis.com/tensorflow/windows/cpu/tensorflow_cpu-2.6.0-cp37-cp37m-win_amd64.whl. I saved the whl file to my local machine, and then installed TensorFlow/Keras, 2.6, CPU-only, for Python 3.7, for Windows.


I used pip to install the CPU-only 2.6.0 version of TensorFlow without any problems. Minor miracle.

Next, I coded up a Iris Dataset demo. I found the 150-item Iris Dataset online — it’s in dozens of places. I manually one-hot encoded the labels by replacing “setosa” with 1, 0, 0 and replacing “versicolor” with 0, 1, 0 and replacing “virginica” with 0, 0, 1.
An alternative strategy is to first manually ordinal encode setosa = 0, versocolor = 1, virginica = 2, and then use tf.keras.utils.to_categorical() to programmatically convert the ordinal values to one-hot values.

I ran the demo without too much trouble. There were about six errors I had to deal with, but that’s to be expected.

Good fun. I’m satisfied that I’m meeting the challenge of keeping up to date with Keras versions, and if I need to work on a Keras project, I can get up to speed quickly.



One of the challenges for people who have an inexplicable desire to follow Hollywood actors is keeping up to date with their wife versions. Here is American actor Nicolas Cage with wife version 1.0, wife version 2.0, wife version 3.0, and wife version 4.0.


Code below.

# iris_tfk.py
# Iris classification
# Anaconda3-2020.02  (Python 3.7.6)
# TensorFlow 2.6.0 (includes KerasTF 2.6.0)
# Windows 10  August 2021

import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'

import numpy as np
import tensorflow as tf
from tensorflow import keras as K

class MyLogger(K.callbacks.Callback):
  def __init__(self, n):
    self.n = n   # print loss and acc every n epochs

  def on_epoch_end(self, epoch, logs={}):
    if epoch % self.n == 0:
      curr_loss =logs.get('loss')
      curr_acc = logs.get('accuracy') * 100
      print("epoch = %4d  loss = %0.6f  acc = %0.2f%%" % \
 (epoch, curr_loss, curr_acc))

def main():
  print("\nIris dataset using Keras/TensorFlow ")
  np.random.seed(1)
  tf.random.set_seed(1)

  print("\nLoading Iris data into memory ")
  data_file = ".\\Data\\iris_data.txt"
  train_x = np.loadtxt(data_file, usecols=[0,1,2,3],
    delimiter=",",  skiprows=0, dtype=np.float32)
  train_y = np.loadtxt(data_file, usecols=[4,5,6],
    delimiter=",", skiprows=0, dtype=np.np.int64)
  #  train_y = np.loadtxt(data_file, usecols=4,
  #    delimiter=",", skiprows=0, dtype=np.np.int64)  # ordinal
  #  train_y = tf.keras.utils.to_categorical(train_y) # one-hot

  print("\nCreating 4-5-3 neural network ")
  model = K.models.Sequential()
  model.add(K.layers.Dense(units=5, input_dim=4,
    activation='tanh', kernel_initializer='glorot_uniform',
    bias_initializer='zeros'))
  model.add(K.layers.Dense(units=3, activation='softmax'))
  model.compile(loss='categorical_crossentropy',
    optimizer='sgd', metrics=['accuracy'])  # default LR

  my_logger = MyLogger(n=3)

  print("\nStarting training ")
  h = model.fit(train_x, train_y, batch_size=1,
    epochs=12, verbose=0, callbacks=[my_logger])  # 1 = chatty
  print("Training finished ")

  eval = model.evaluate(train_x, train_y, verbose=0)
  print("\nModel evaluation: loss = %0.6f  accuracy = %0.2f%% " \
    % (eval[0], eval[1]*100) )

  print("\nSaving trained model as file iris_model.h5 ")
  # model.save_weights(".\\Models\\iris_model.h5")
  model.save(".\\Models\\iris_model.h5")

  # --------------
  np.set_printoptions(precision=4, suppress=True)
  wts = model.get_weights()
  print("\nTrained model weights and biases: ")
  for ar in wts:
    print(ar)
    print("")
  # --------------

  np.set_printoptions(precision=4)
  unknown = np.array([[6.1, 3.1, 5.1, 1.1]],
    dtype=np.float32)
  predicted = model.predict(unknown)
  print("\nUsing model to predict species for features: ")
  print(unknown)
  print("\nPredicted species is: ")
  print(predicted)

if __name__ == "__main__":
  main()
Posted in Keras | Leave a comment

Wasserstein Distance Using C# and Python in Visual Studio Magazine

I wrote an article titled “Wasserstein Distance Using C# and Python” in the August 2021 edition of Microsoft Visual Studio Magazine. See https://visualstudiomagazine.com/articles/2021/08/16/wasserstein-distance.aspx.

There are many different ways to measure the distance between two probability distributions. Some of the most commonly used distance functions are Kullback-Leibler divergence, symmetric Kullback-Leibler distance, Jensen-Shannon distance, and Hellinger distance. The article shows how to compute the Wasserstein distance and explains why it is often preferable to alternative distance functions.

The Wasserstein distance (also known as Earth Mover Distance) is best explained by an example. Suppose P = (0.2, 0.1, 0.0, 0.0, 0.3, 0.4) and Q = (0.0, 0.5, 0.3, 0.0, 0.2, 0.0, 0.0). See the image below. If you think of distribution P as piles of dirt and distribution Q as holes, the Wasserstein distance is the minimum amount of work required to transfer all the dirt in P to the holes in Q.

The transfer can be accomplished in six steps.

1. all 0.2 in dirt[0] is moved to holes[1], using up dirt[0], with holes[1] needing 0.3 more.

2. all 0.1 in dirt[1] is moved to holes[1], using up dirt[1], with holes[1] needing 0.2 more.

3. just 0.2 in dirt[4] is moved to holes[1], filling dirt[1], leaving 0.1 left in dirt[4].

4. all remaining 0.1 in dirt[4] is moved to holes[2], using up dirt[4], with holes[2] needing 0.2 more.

5. just 0.2 in dirt[6] is moved to holes[2], filling holes[2], leaving 0.2 left in dirt[6].

6. all remaining 0.2 in dirt[6] is moved to holes[4], using up dirt[6], filling holes[4].

In each transfer, the amount of work done is the flow (amount of dirt moved) times the distance. The Wasserstein distance is the total amount of work done. Put slightly differently, the Wasserstein distance between two distributions is the effort required to transform one distribution into the other.

For the Python language version, I defined a program-defined my_wasserstein() function using two helper functions and one primary function:

# -----------------------------------------------------

def first_nonzero(vec):
  dim = len(vec)
  for i in range(dim):
    if vec[i] "gt" 0.0:  # replace "gt"
      return i
  return -1  # no empty cells found

# -----------------------------------------------------

def move_dirt(dirt, di, holes, hi):
  if dirt[di] "lte" holes[hi]:   # use all dirt
    flow = dirt[di]
    dirt[di] = 0.0            # all dirt got moved
    holes[hi] -= flow         # less to fill now
  elif dirt[di] "gt" holes[hi]:  # use just part of dirt
    flow = holes[hi]          # fill remainder of hole
    dirt[di] -= flow          # less dirt left
    holes[hi] = 0.0           # hole is filled
  dist = np.abs(di - hi)
  return flow * dist          # work

# -----------------------------------------------------

def my_wasserstein(p, q):
  dirt = np.copy(p) 
  holes = np.copy(q)
  tot_work = 0.0

  while True:  # TODO: add sanity counter check
    from_idx = first_nonzero(dirt)
    to_idx = first_nonzero(holes)
    if from_idx == -1 or to_idx == -1:
      break
    work = move_dirt(dirt, from_idx, holes, to_idx)
    tot_work += work
  return tot_work

# -----------------------------------------------------

The function can be called like so:

def main():
  print("Begin Wasserstein distance demo ")

  P =  np.array([0.6, 0.1, 0.1, 0.1, 0.1])
  Q1 = np.array([0.1, 0.1, 0.6, 0.1, 0.1])
  Q2 = np.array([0.1, 0.1, 0.1, 0.1, 0.6])

  wass_p_q1 = my_wasserstein(P, Q1)
  wass_p_q2 = my_wasserstein(P, Q2)

  print("Wasserstein distances: ")
  print("P to Q1 : %0.4f " % wass_p_q1)
  print("P to Q2 : %0.4f " % wass_p_q2)

  print("End demo ")

if __name__ == "__main__":
  main()

The Wasserstein distance is slightly more complicated than alternatives like Kullback-Leibler, Jensen-Shannon, and Hellinger, but Wasserstein has better math properties and better practical properties.

The Wasserstein distance is an example of a classical statistics technique. Sometimes classical techniques can seem like they’re not very relevant, but there are many systems where classical statistics is combined with contemporary neural techniques — the Wasserstein Generative Adversarial Network (WGAN) is one example.



Three contemporary artists whose work I like. They use a combination of classical and modern styles. Left: By artist Richard Burlet. Center: By artist Manuel Nunez. Right: By artist Tang Wei Min.

Posted in Machine Learning | Leave a comment

NFL 2021 Week 1 Super Early Predictions – Zoltar Likes Four Underdogs

Zoltar is my NFL football prediction computer program. It uses reinforcement learning and a neural network. Here are Zoltar’s very early, preliminary predictions for week #1 of the 2021 season. I’ll re-run Zoltar again, closer to the start of the season on Thursday, September 9, when there’ll be updated Las Vegas point spread information. It usually takes Zoltar about three weeks to hit his stride.

Zoltar:  buccaneers  by    5  dog =     cowboys    Vegas:  buccaneers  by  6.5
Zoltar:     falcons  by    1  dog =      eagles    Vegas:     falcons  by  3.5
Zoltar:       bills  by    3  dog =    steelers    Vegas:       bills  by    6
Zoltar:    panthers  by    5  dog =        jets    Vegas:    panthers  by    4
Zoltar:     vikings  by    1  dog =     bengals    Vegas:     vikings  by    3
Zoltar:       colts  by    1  dog =    seahawks    Vegas:       colts  by  2.5
Zoltar:       lions  by    1  dog = fortyniners    Vegas: fortyniners  by  7.5
Zoltar:      texans  by    5  dog =     jaguars    Vegas:     jaguars  by  2.5
Zoltar:      titans  by    5  dog =   cardinals    Vegas:      titans  by  2.5
Zoltar:    redskins  by    2  dog =    chargers    Vegas:    chargers  by    1
Zoltar:      chiefs  by    5  dog =      browns    Vegas:      chiefs  by    6
Zoltar:      saints  by    1  dog =     packers    Vegas:      saints  by  2.5
Zoltar:    dolphins  by    1  dog =    patriots    Vegas:    patriots  by    2
Zoltar:      giants  by    3  dog =     broncos    Vegas:     broncos  by    1
Zoltar:        rams  by    4  dog =       bears    Vegas:        rams  by    7
Zoltar:      ravens  by    1  dog =     raiders    Vegas:      ravens  by  4.5 

Zoltar theoretically suggests betting when the Vegas line is more than 3.0 points different from Zoltar’s prediction. Based on preliminary data, Zoltar has four preliminary suggestions at this time.

Because of the way Zoltar initializes his calculations, all four preliminary recommendations are on Vegas underdogs.

1. Zoltar likes Vegas underdog Lions against the 49ers.
2. Zoltar likes Vegas underdog Texans against the Jaguars.
3. Zoltar likes Vegas underdog Giants against the Broncos.
4. Zoltar likes Vegas underdog Raiders against the Ravens.

For example, a bet on the Lions will pay off if the Lions win by any score, or if the favored 49ers win but by less than 7.5 points.

Theoretically, if you must bet $110 to win $100 (typical in Vegas) then you’ll make money if you predict at 53% accuracy or better. But realistically, you need to predict at 60% accuracy or better.



My system is named after the Zoltar fortune teller machine you can find in arcades. There are many variations of Zoltar.

Posted in Zoltar | Leave a comment

Comparing Wasserstein Distance with Kullback-Leibler Distance

There are many ways to calculate the distance between two probability distributions. Four of the most common are Kullback-Leibler (KL), Jensen-Shannon (JS), Hellinger (H), and Wasserstein (W). When I was in school, I learned that W was superior to KL, JS, and H. For quite some time I wanted to verfy this to my satisfaction by coding up a demo program. However, I wanted to use from-scratch implementations of all four distances, rather than use library functions such as the wasserstein_distance() function from the scipy library.

I recently implemented all four distance functions from scratch and so I was ready to verify the superiority of Wasserstein distance over Kullback-Leibler, Jensen-Shannon, and Hellinger.

I used three distributions:

"Left"   : [0.6  0.1  0.1  0.1  0.1]
"Center" : [0.1  0.1  0.6  0.1  0.1]
"Right"  : [0.1  0.1  0.1  0.1  0.6]

Common sense tells you that the distances between Left-Center and Center-Right should be the same, and the distance between Left-Right should be about twice as much.

With Wasserstein distance, that’s exactly what happens. But with Kullback-Leibler, Jensen-Shannon, and Hellinger, all three distances are the same. Note that my Wasserstein implementation isn’t true Wasserstein. Mine is best described as, “1D discrete probability distribution information transfer distance” — a very specific version of one of dozens of variations of Wasserstein distance.

The conceptual superiority of Wasserstein leads to the question: When would you ever want to use KL, JS, or H distance?

I’m not completely sure. My hunch is that because Wasserstein distance is significantly more complicated to implement, and the scipy version of wasserstein_distance() is poorly documented, that people tend to use KL, JS, and H because of their simplicity. And the KL divergence has a special case that can be computed easily when comparing two distributions when one is Normal with mean = 0 and sd = 1.



I sometimes take an image I like, then use an Internet image search to find similar images, then select one of them, then search for an image similar to that image. Here are three related images I found in this way from artists Hans Jochem Bakker, Randy Monteith, and Pairoj Karndee. The Wasserstein distance can be used to compute the distance between images. I speculate that the distance between the left and center images is less than left-right and center-right.


Code below. Long. Continue reading

Posted in Machine Learning | Leave a comment