A Quick Demo of the DBSCAN Clustering Algorithm

I was reading a research paper this morning and the paper used the DBSCAN (“density-based spatial clustering of applications with noise”) clustering algorithm. DBSCAN is somewhat similar to k-means clustering. Both work only with strictly numeric data.

In k-means you must specify the number of clusters. DBSCAN doesn’t require you to specify the number of clusters, but for DBSCAN you must specify an epsilon value (how close is “close”) and a minimum number of points that constitute a core cluster. These two DBSCAN parameters implicitly determine the number of clusters.

I hadn’t used DBSCAN in a long time so I coded up a quick demo to refresh my memory. Implementing DBSCAN from scratch isn’t too difficult (I’ve done so using the C# language), but the scikit-learn Python language code library has a built-in implementation that’s simple and eaasy to use. So my demo was based on the scikit documentation example.

I set up 20 items of dummy data. I used 2D data so that I could graph the results. Like most clustering algorithms, the source data must be normalized so that large magnitude items don’t swamp small magnitude items.

The clustering function assigns a cluster ID label to each data item: 0, 1, 2, etc. Items that don’t get assigned to a cluster get a label of -1 to indicate they are “noise”.

I used the documentation code to create a graph of the clustering results. The five red dots are class 0, six green dots are class 1, and four blue dots are class 2. Noise items are colored black.

Data items that belong to clusters can be “core points” or non-core points. The large dots are core points, the smaller dots are non-core points.

Good fun!



A cluster of three clever photographs by Serge Lutens (1942-) with a playing card theme. Lutens is maybe best known for his work in the 1980s for Shiseido, a Japanese cosmetics company.


Code:

# dbscan_cluster.py

import numpy as np
from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt

print("\nBegin clustering demo with DBSCAN ")

X  = np.array([
 [ 0.325, -0.595],
 [ 0.507,  1.619],
 [ 0.817,  1.895],
 [ 1.147,  0.764],
 [-1.285, -0.95 ],
 [-1.237, -0.532],
 [ 1.108,  1.248],
 [-0.847, -0.722],
 [ 0.124, -1.346],
 [ 0.910, -0.227],
 [ 0.310, -0.756],
 [-1.384, -0.715],
 [ 0.736,  1.15 ],
 [ 0.511, -0.517],
 [-1.081, -0.91 ],
 [ 0.416,  1.252],
 [-2.349, -0.42 ],
 [-0.559, -1.161],
 [ 0.806,  1.054],
 [ 1.023, -0.133]])

print("\nX data: ")
print(X)

print("\nPerforming clustering ")
clustering = DBSCAN(eps=0.5, min_samples=4).fit(X)
print("Done ")

print("\nComputed cluster labels: ")
print(clustering.labels_)

print("\nIndices of core points: ")
print(clustering.core_sample_indices_)

n_clusters = np.max(clustering.labels_) + 1
counts = np.zeros(n_clusters+1, dtype=np.int64)  # noise
for i in range(len(X)):
  lbl = clustering.labels_[i]
  if lbl == -1:
    counts[n_clusters] += 1
  else:
    counts[lbl] += 1
print("\nCluster counts, noise count: ")
print(counts)

print("\nDisplaying clusering: " )

core_samples_mask = np.zeros_like(clustering.labels_, \
  dtype=bool)
core_samples_mask[clustering.core_sample_indices_] = True

unique_labels = set(clustering.labels_)
colors = [plt.cm.hsv(each) \
          for each in np.linspace(0, 1, \
            len(unique_labels))]
for k, col in zip(unique_labels, colors):
  if k == -1:
    col = [0, 0, 0, 1]  # noise = black

  class_member_mask = (clustering.labels_ == k)

  xy = X[class_member_mask & core_samples_mask]
  plt.plot(xy[:, 0], xy[:, 1], 'o', 
    markerfacecolor=tuple(col),
    markeredgecolor='k', markersize=14)

  xy = X[class_member_mask & ~core_samples_mask]
  plt.plot(xy[:, 0], xy[:, 1], 'o',
    markerfacecolor=tuple(col),
    markeredgecolor='k', markersize=6)
plt.show()

print("\nEnd demo ")
Posted in Machine Learning | Leave a comment

Differential Evolution Optimization in Visual Studio Magazine

I wrote an article titled “Differential Evolution Optimization” in the September 2021 edition of the Microsoft Visual Studio Magazine. See https://visualstudiomagazine.com/articles/2021/09/07/differential-evolution-optimization.aspx.

The most common type of optimization for neural network training is some form of stochastic gradient descent (SGD). SGD has many variations such as Adam (adaptive momentum estimation) and Adagrad (adaptive gradient). All SGD-based optimization algorithms use the Calculus derivative (gradient) of an error function. But there are alternative optimization techniques that don’t use gradients. Examples include bio-inspired optimization techniques such as genetic algorithms and particle swarm optimization and geometry-inspired techniques such as Nelder-Mead and spiral dynamics.

My article explains how to implement a bio-inspired optimization technique called differential evolution optimization (DEO).

An evolutionary algorithm is any algorithm that loosely mimics biological evolutionary mechanisms such as mating, chromosome crossover, mutation and natural selection. Standard evolutionary algorithms can be implemented using dozens of specific techniques. Differential evolution is a special type of evolutionary algorithm that has a relatively well-defined structure:

create a population of possible solutions
loop
  for-each possible solution
    pick three other random solutions
    combine the three to create a mutation
    combine curr solution with mutation = candidate
    if candidate is better than curr solution then
      replace current solution with candidate
    end-if
  end-for
end-loop
return best solution found

The “differential” term in “differential evolution” is somewhat misleading. Differential evolution does not use Calculus derivatives. The “differential” refers to a specific part of the algorithm where three possible solutions are combined to create a mutation, based on the difference between two of the possible solutions.

Differential evolution optimization was originally designed for use in electrical engineering problems. But DEO has received increased interest as a possible technique for training deep neural networks. The biggest disadvantage of DEO is performance. DEO typically takes much longer to train a deep neural network than standard stochastic gradient descent (SGD) optimization techniques. However, DEO is not subject to the SGD vanishing gradient problem. At some point in the future, it’s quite possible that advances in computing power (through quantum computing) will make differential evolution optimization a viable alternative to SGD training techniques.



There are quite a few interesting science fiction movies that involve alien DNA altering evolution. Here are three, all from 1995. Left: In “Species”, scientists use instructions sent by aliens to splice alien DNA with human DNA. The result was not so good. Center: In “Mosquito”, an alien spacecraft crashes in a forest. A regular mosquito ingests some alien DNA and . . the result was not so good for campers in the area. Right: In “Village of the Damned”, 10 women are mysteriously impregnated by alien DNA. The resulting 10 children don’t turn out to be very friendly.

Posted in Machine Learning | Leave a comment

Yet Another MNIST Example Using Keras

It’s a major challenge to keep up with the continuous changes to the Keras/TensorFlow neural code library (and the PyTorch library too). I recently upgraded my Keras installation to version 2.6 and so I’m going through all my standard examples to bring them up to date with the inevitable changes.

I was using a new desktop machine and so I had to install TensorFlow 2.6 (which contains Keras 2.6). I ran into unexpected problems when the “wrapt” sub-library refused to build correctly (the installation process builds a .whl file for wrapt instead of using a pre-built .whl file). I found a hack online that suggested issuing the command SET WRAPT_INSTALL_EXTENSIONS=false, before the command pip install tensorflow, and that magically worked. Somehow.

One of the standard examples is the MNIST image dataset. There are 70,000 simple images (60,000 training images and 10,000 test images). Each image has 28×28 pixels and is a handwritten digit from ‘0’ to ‘9’. Each pixel value is between 0 and 255.

I used the built-in MNIST dataset from Keras but I could have loaded the raw MNIST data using np.loadtxt() or a similar function.

I used the Model() approach by defining separate layers and then passing the first and last layer to the Model() constructor. An alternative design is to use the Sequential() approach. I have no strong preference between Model() and Sequential() — it’s just syntax.

To save time, I only used 2 training epochs with a batch size of 100. In a non-demo scenario I’d use more epochs, but then have to watch for over-fitting.

After the model was trained, I set up a fake 28×28 image with one vertical bar, one horizontal bar, and one diagonal bar. The trained model predicted the fake image is a ‘5’ with pseudo-probability = 1.0000. Many machine learning systems aren’t very good at distinguishing between authentic and fake items. This has generated interest in ML systems that output a prediction plus a confidence score of some type.

Good fun!



Left: A clever pseudo Chanel bag made from a paper grocery bag and a chain from a hardware store. Center: A whimsical pseudo Louis Vuitton bag, complete with misspelling. Right: A serious attempt at a Coach bag, but I don’t think it will fool many people.


Code below.

# mnist_tfk.py
# MNIST using CNN 
# Keras 2.6.0 in TensorFlow 2.6.0 ("_tfk")
# Anaconda3-2020.02  Python 3.7.6  Windows 10

import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'  # suppress warn

import numpy as np
import tensorflow as tf
from tensorflow import keras as K
import matplotlib.pyplot as plt

def main():
  # 0. get started
  print("\nBegin MNIST using Keras %s " % K.__version__)
  np.random.seed(1)
  tf.random.set_seed(1)

  # 1. load data
  print("\nLoading train and test data ")
  (train_x, train_y), \
  (test_x, test_y) = K.datasets.mnist.load_data()
  train_x = train_x.reshape(60_000, 28, 28, 1)
  test_x = test_x.reshape(10_000, 28, 28, 1)
  train_x = train_x.astype(np.float32)
  test_x = test_x.astype(np.float32)
  train_x /= 255
  test_x /= 255
  train_y = K.utils.to_categorical(train_y, 10)
  test_y = K.utils.to_categorical(test_y, 10)

  # 2. define model
  print("\nCreating network with two Convolution, \
two Dropout, two Dense layers ")
  g_init = K.initializers.glorot_uniform(seed=1)
  opt = K.optimizers.Adam(learning_rate=0.01)

  x = K.layers.Input(shape=(28,28,1))
  con1 = K.layers.Conv2D(filters=32,
    kernel_size=(3,3), kernel_initializer=g_init,
    activation='relu', padding='valid')(x)
  con2 = K.layers.Conv2D(filters=64,
    kernel_size=(3,3), kernel_initializer=g_init,
    activation='relu', padding='valid')(con1)
  mp1 = K.layers.MaxPooling2D(pool_size=(2,2))(con2)
  do1 = K.layers.Dropout(0.25)(mp1)
  z = K.layers.Flatten()(do1)
  fc1 = K.layers.Dense(units=128,
    kernel_initializer=g_init, activation='relu')(z)
  do2 = K.layers.Dropout(0.5)(fc1)
  fc2 = K.layers.Dense(units=10,
    kernel_initializer=g_init, activation='softmax')(do2)

  model = K.models.Model(x, fc2)

  model.compile(loss='categorical_crossentropy',
    optimizer=opt, metrics=['accuracy'])
  
  # 3. train model
  bat_size= 100
  max_epochs = 2
  print("\nStarting training with batch size = %d " % bat_size)
  model.fit(train_x, train_y, batch_size=bat_size,
    epochs=max_epochs, verbose=1)
  print("Training finished ")

  # 4. evaluate model
  eval = model.evaluate(test_x, test_y, verbose=0)
  loss = eval[0]
  acc = eval[1] * 100
  print("\nTest data: loss = %0.4f \
 accuracy = %0.2f%%" % (loss, acc))

  # 5. save model
  print("\nSaving MNIST model to disk ")
  # mp = ".\\Models\\mnist_model.h5"
  # model.save(mp)

  # 6. use model
  print("\nMaking prediction for fake image: ")
  # np.set_printoptions(precision=4, suppress=True)
  np.set_printoptions(formatter={'float': '{: 0.4f}'.format})

  x = np.zeros(shape=(28,28), dtype=np.float32)
  for row in range(5,23):
    x[row][9] = 180  # vertical line
  for rc in range(9,19):
    x[rc][rc] = 250  # diagonal
  for col in range(5,15):  
    x[14][col] = 200  # horizontal

  plt.imshow(x, cmap=plt.get_cmap('gray_r'))
  plt.show()

  x = x.reshape(1, 28, 28, 1)
  pred_probs = model.predict(x)
  print("\nPrediction probabilities: ")
  print(pred_probs)
  
  pred_digit = np.argmax(pred_probs)
  print("\nPredicted digit: ")
  print(pred_digit)

  print("\nEnd MNIST demo ")

if __name__ == "__main__":
  main()
Posted in Keras | Leave a comment

NFL 2021 Week 2 Predictions – Zoltar Likes Eight Vegas Underdogs

Zoltar is my NFL football prediction computer program. It uses reinforcement learning and a neural network. Here are Zoltar’s predictions for week #2 of the 2021 season. These predictions are tentative, in the sense that it usually takes Zoltar about three weeks to hit his stride.

Zoltar:    redskins  by    4  dog =      giants    Vegas:    redskins  by    3
Zoltar:      saints  by    0  dog =    panthers    Vegas:      saints  by  3.5
Zoltar:       bears  by    4  dog =     bengals    Vegas:       bears  by    3
Zoltar:      browns  by    6  dog =      texans    Vegas:      browns  by 12.5
Zoltar:       colts  by    2  dog =        rams    Vegas:        rams  by    4
Zoltar:     broncos  by    0  dog =     jaguars    Vegas:     broncos  by    6
Zoltar:    dolphins  by    2  dog =       bills    Vegas:       bills  by  3.5
Zoltar:    patriots  by    0  dog =        jets    Vegas:    patriots  by    6
Zoltar:      eagles  by    2  dog = fortyniners    Vegas: fortyniners  by  3.5
Zoltar:    steelers  by    6  dog =     raiders    Vegas:    steelers  by    6
Zoltar:   cardinals  by    6  dog =     vikings    Vegas:   cardinals  by  4.5
Zoltar:  buccaneers  by    6  dog =     falcons    Vegas:  buccaneers  by 12.5
Zoltar:    chargers  by    6  dog =     cowboys    Vegas:    chargers  by    3
Zoltar:    seahawks  by    6  dog =      titans    Vegas:    seahawks  by  5.5
Zoltar:      chiefs  by    0  dog =      ravens    Vegas:      chiefs  by    4
Zoltar:     packers  by    6  dog =       lions    Vegas:     packers  by 10.5

Zoltar theoretically suggests betting when the Vegas line is “significantly” different from Zoltar’s prediction. In mid-season I use 3.0 points difference but for the first few weeks of the season I go a bit more conservative and use 4.0 points difference as the advice threshold criterion.

At the beginning of the season, because of Zoltar’s initialization (all teams regress to an average power rating) and other algorithms, Zoltar is very strongly biased towards Vegas underdogs. I need to fix this.

1. Zoltar likes Vegas underdog Texans against the Browns.
2. Zoltar likes Vegas underdog Colts against the Rams.
3. Zoltar likes Vegas underdog Jaguars against the Broncos.
4. Zoltar likes Vegas underdog Dolphins against the Bills.
5. Zoltar likes Vegas underdog Jets against the Patriots.
6. Zoltar likes Vegas underdog Eagles against the 49ers.
7. Zoltar likes Vegas underdog Falcons against the Buccaneers.
8. Zoltar likes Vegas underdog Lions against the Packers.

For example, a bet on the underdog Texans against the Browns will pay off if the Texans win by any score, or if the favored Browns win but by less than 12.5 points (in other words, win by 12 points or fewer).

Theoretically, if you must bet $110 to win $100 (typical in Vegas) then you’ll make money if you predict at 53% accuracy or better. But realistically, you need to predict at 60% accuracy or better.

In week #1, against the Vegas point spread, Zoltar went 4-2 (using 4.0 points as the advice threshold). Zoltar was correct in recommending Vegas underdogs Lions (thanks to a late spread change plus the 49ers giving up 16 points in the final few minutes), Texans, Cardinals, Raiders. Zoltar was wrong in recommending underdogs Colts and Giants.

Just for fun, I track how well Zoltar does when just trying to predict just which team will win a game. This isn’t useful except for parlay betting. In week #1, just predicting the winning team, Zoltar went 8-8 which isn’t very good but is typical of the first few weeks of the season. In week #1, just predicting the winning team, Vegas — “the wisdom of the crowd” — went 9-7.

Zoltar sometimes predicts a 0-point margin of victory. There are four such games in week #2. In those situations, to pick a winner (only so I can track raw number of correct predictions) in the first few weeks of the season, Zoltar picks the home team to win. After that, Zoltar uses his algorithms to pick a winner.



Zoltar uses machine learning rather than a crystal ball. Left: “The Wizard of Oz” (1933). Center: “Labyrinth” (1986). Right: “Harry Potter and the Prisoner of Azkaban” (2004).

Posted in Zoltar | 1 Comment

Determining If Two Sentences Are Paraphrases Of Each Other Using Hugging Face

Deep neural systems based on Transformer Architecture (TA) have revolutionized the field of natural language processing (NLP). Unfortunately, TA systems are insanely complex, meaning that implementing a TA system from scratch is not feasible, and implementing TA using a low-level library like PyTorch or or Keras or TensorFlow is only barely feasible.

The Hugging Face library (I hate that name . .) is a high-level code library (but like the library . .) that makes writing TA systems simple — with the downside that customizing a TA system built on Hugging Face can be very difficult.

I recently started work on a speculative project that will use a TA system. In our first team meeting, we decided that our initial approach will be to start with a Hugging Face model and then attempt to customize it, rather than try to build the system using PyTorch or Keras.

Even though I’ve been a software developer for many years, I forgot how to tackle the project. I incorrectly started by looking at all the Hugging Face technical documentation. I quickly got overwhelmed. After taking a short break, I remembered how I learn technology topics — from specific to general. In other words, I learn best by looking at many small, concrete examples. Over time, I learn the big picture. This is in sharp contrast to how some people learn — from general to specific. Those people start by learning the big picture and then learn how to construct concrete examples.

So, my plan is to look at one or two concrete examples of Hugging Face code every day or so. I know from previous experience that it’s important to have buffer time between explorations. My brain can only accept so much technical information until the effect of psychological interference starts — new information bounces off and interferes with old information.

My first example was a paraphrase analysis. Briefly, two sentences are paraphrases if the essentially mean the same thing. I set up two sentences:

phrase_0 = "Machine Learning (ML) makes predictions from data"
phrase_1 = "ML uses data to compute a prediction."

Although the concept of paraphrases is somewhat subjective, most people would say the two sentences are in fact paraphrases of each other. The demo program is remarkably short because the Hugging Face library is so high-level. The demo emitted two associated pseudo-probabilities: the probability that the sentences are not paraphrases, and the probability that the sentences are paraphrases. The pseudo-probability values were [0.058, 0.942] so the model strongly believed the two sentences are in fact paraphrases.

Next step: another concrete Hugging Face example. And then another, and another until the big picture gels in my head.



According to the Google Image Search Similarity tool, these three portraits are artistic paraphrases of each other. Left: Russian woman. Center: Spanish woman. Right: Italian woman.


# paraphrase_test.py

from transformers import AutoTokenizer,
  AutoModelForSequenceClassification
import torch

print("\nBegin HugFace paraphrase example ")

toker = 
  AutoTokenizer.from_pretrained \
  ("bert-base-cased-finetuned-mrpc")
model = 
  AutoModelForSequenceClassification.from_pretrained \
  ("bert-base-cased-finetuned-mrpc")

phrase_0 = "Machine Learning (ML) makes predictions from data"
phrase_1 = "ML uses data to compute a prediction."
print("\nFirst phrase: ")
print(phrase_0)
print("\nSecond phrase: ")
print(phrase_1)

phrases = toker(phrase_0, phrase_1, return_tensors="pt")
# print(type(phrases))
# 'transformers.tokenization_utils_base.BatchEncoding'
# derived from a Dictionary

with torch.no_grad():
  result_logits = model(**phrases).logits
result_probs = torch.softmax(result_logits, dim=1).numpy()

print("\nPseudo-probabilities of not-a-para, is-a-para: ")
print(result_probs)

print("\nEnd HugFace example ")
Posted in PyTorch | Leave a comment

A Simplified Approach for Ordinal Classification

In a standard classification problem, the goal is to predict a class label. For example, in the Iris Dataset problem, the goal is to predict a species of flower: 0 = “setosa”, 1 = “versicolor”, 2 = “virginica”. Here the class labels are just labels wthout any meaning attache to the order. In an ordinal classification problem (also called ordinal regression), the class labels have order. For example, you might want to predict the median price of a house in one of 506 towns, where price can be 0 = very low, 1 = low, 2 = medium, 3 = high, 4 = very high. For an ordinal classification problem, you could just use standard classification, but that approach doesn’t take advantage of the ordering information in the training data.

I coded up a demo of a simple technique using the PyTorch code library. The same technique can be used with Keras/TensorFlow too.

I used a modified version of the Boston Housing dataset. There are 506 data items. Each item is a town near Boston. There are 13 predictor variables — crime rate in town, tax rate in town, proportion of Black residents in town, and so on. The original Boston dataset contains the median price of a house in each town, divided by $1,000 — like 35.00 for $35,000 (the data is from the 1970s when house prices were low). To convert the data to an ordinal classification problem, I mapped the house prices like so:

       price          class  count
[$0      to $10,000)    0      24
[$10,000 to $20,000)    1     191
[$20,000 to $30,000)    2     207
[$30,000 to $40,000)    3      53
[$40,000 to $50,000]    4      31
                              ---
                              506

I normalized the numeric predictor values by dividing by a constant so that each normalized value is between -1.0 and +1.0. I encoded the single Boolean predictor value (does town border the Charles River) as -1 (no), +1 (yes).

The technique I used for ordinal classification is something I invented myself, at least as far as I know. I’ve never seen the technique I used anywhere else, but it’s not too complicated and so it could exist under an obscure name of some sort.

For the modified Boston Housing dataset there are k = 5 classes. The class target values in the training data are (0, 1, 2, 3, 4). My neural network system outputs a single numeric value between 0.0 and 1.0 — for example 0.2345. The class target values of (0, 1, 2, 3, 4) generate associated floating point sub-targets of (0.1, 0.3, 0.5, 0.7, 0.9). When I read the data into memory as a PyTorch Dataset object, I map each ordinal class label to the associated floating point target. Then I use standard MSELoss() to train the network.

Suppose a data item has class label = 3 (high price). The target value for that item is stored as 0.7. The computed predicted price will be something like 0.66 (close to target, so low MSE error and a correct prediction) or maybe 0.23 (far from target, so high MSE error and a wrong prediction). With this scheme, the ordering information is used.

For implementation, most of the work is done inside the Dataset object:

class BostonDataset(T.utils.data.Dataset):
  # features are in cols [0,12], median price as int in [13]

  def __init__(self, src_file, k):
    # k is for class_to_target_program()
    tmp_x = np.loadtxt(src_file, usecols=range(0,13),
      delimiter="\t", comments="#", dtype=np.float32)
    tmp_y = np.loadtxt(src_file, usecols=13,
      delimiter="\t", comments="#", dtype=np.int64)

    n = len(tmp_y)
    float_targets = np.zeros(n, dtype=np.float32)  # 1D

    for i in range(n):  # hard-coded is easy to understand
      if tmp_y[i] == 0: float_targets[i] = 0.1
      elif tmp_y[i] == 1: float_targets[i] = 0.3
      elif tmp_y[i] == 2: float_targets[i] = 0.5
      elif tmp_y[i] == 3: float_targets[i] = 0.7
      elif tmp_y[i] == 4: float_targets[i] = 0.9
      else: print("Fatal logic error ")

    float_targets = np.reshape(float_targets, (-1,1))  # 2D

    self.x_data = \
      T.tensor(tmp_x, dtype=T.float32).to(device) 
    self.y_data = \
      T.tensor(float_targets, dtype=T.float32).to(device)

  def __len__(self):
    return len(self.x_data)

  def __getitem__(self, idx):
    preds = self.x_data[idx]  # all cols
    price = self.y_data[idx]  # all cols
    return (preds, price)     # tuple of two matrices

There are a few minor, but very tricky details. They’d take much too long too explain in a blog post, so I’ll just say that if you’re interested, examine the code very carefully.



I don’t think it’s possible to assign a strictly numeric value to art. Here are two clever illustrations by artist Casimir Lee. I like the bright colors and combination of 1920s art deco style with 1960s psychedelic style.


Code below. Long. Continue reading

Posted in PyTorch | Leave a comment

Example of Computing Kullback-Leibler Divergence for Continuous Distributions

In this post, I present an example of estimating the Kullback-Leibler (KL) divergence between two continuous distributions using the Monte Carlo technique. Whoa! Just stating the problem has a massive amount of information.

The KL divergence is the key part of the ELBO (evidence lower bound) loss function, which in turn is the key to the creation of Variational Autoencoder (VAE) neural systems that generate synthetic data.

You can tell right away that there are dozens of interrelated topics here — individually they are moderately complicated, but when you combine all of them, things get fantastically complicated. This is the general challenge in all machine learning techniques.

A good way to think about what a continuous distribution for the context of this post is focusing on one specific example: the Normal (aka Gaussian) distribution is characterized completely by a mean (mu, average value) and a standard deviation (sd, sigma, measure of spread). If you graph the Normal distribution you get a bell-shaped curve with X from -infinity to +infinity — any X value is possible (hence “continuous”) but if you pick an X randomly, it will almost always be between -4 * sigma and +4 * sigma. The curve is called the probability density function (PDF). The height of the PDF curve at a value x is not the probability of x but is a measure of how likely x is.

The Kullback-Leibler divergence is a number that describes how different two distributions (P and Q) are. If KL(P, Q) = 0, the two distributions are the same. The larger the value of KL(P, Q), the greater the difference between P and Q. Note that because of the way KL is defined mathematically, KL(P, Q) is not necessarily equal to KL(Q, P) — KL is not symmetric. KL can be defined for continuous distributions (the topic of this post) or for discrete distributions (a different topic).

The Monte Carlo technique loosely means “an estimate computed by simulation”. Suppose you wanted to estimate the probability of getting Heads if you flip a specific coin. You could flip the coin 100 times. If you got 53 Heads (and 47 Tails) your estimate of pr(Heads) = 53/100 = 0.53.

The demo has three parts. The first part sets up P and Q and illustrates key concepts. The second part computes a KL(P, Q) divergence using the Monte Carlo technique. The third part computes KL for the same P and Q using a magic math equation.

The first part sets up P as a Normal distribution with mean = 0.0 and standard deviation = 1.0 — this is the Standard Normal distribution. A second distribution Q with mean = 3.0 and standard deviation = 1.4 is created. Next a value z = 2.5 is set:

p_mu = T.tensor([0.0], dtype=T.float32)
p_sd = T.tensor([1.0], dtype=T.float32)

q_mu = T.tensor([3.0], dtype=T.float32)
q_sd = T.tensor([1.4], dtype=T.float32)

P = T.distributions.Normal(p_mu, p_sd)
Q = T.distributions.Normal(q_mu, q_sd)

print("\nSetting z = 2.5 ")
z = T.tensor([2.5], dtype=T.float32)

I use PyTorch tensors but the exact same ideas work with NumPy arrays too.

Next, the demo computes two PDF values:

log_p_df_z = P.log_prob(z)
p_df_z = T.exp(log_p_df_z).item()

log_q_df_z = Q.log_prob(z)
q_df_z = T.exp(log_q_df_z).item()

print("Semi-likelihood z comes from P = %0.4f " % p_df_z)
print("Semi-likelihood z comes from Q = %0.4f " % q_df_z)

The P.log_prob(z) function gives the log of the value of the P = N(0,1) distribution PDF at z = 2.5. The log() is used because 1.) raw PDF values are often extremely small, and 2.) log() values are much easier to work with. The minor downside is that log() values are negative. I use T.exp() to convert the values from log to raw. I write “semi-likelihood” to emphasize the values aren’t true probabilities.

The likelihood value that z = 2.5 came from P = N(0,1) is 0.0175 and the likelihood z came from Q = N(3, 1.4) is 0.2674. In other words, it’s more likely z came from Q than from P. This makes sense because z is much closer to the mean of Q than it is to the mean of P.

The second part of the demo uses the concepts to estimate KL(P,Q) using the Monte Carlo technique:

n_samples = 100

print("\nEstimating KL(P,Q) using %d samples " % n_samples)
sum = 0.0
for i in range(n_samples):
  z = Q.rsample()
  kl = Q.log_prob(z) - P.log_prob(z)  # magic math
  sum += kl.item()

KL = sum / n_samples
print("Estimated Monte Carlo KL = %0.4f " % KL)

The code iterates 100 times. In each iteration a random value z is drawn from the Q distribution using the rsample() method. The PDF likelihood values that z comes from P and from Q are computed (in log form). The magic math is that the KL divergence is the difference between the log_pdf values. The 100 estimated KL(P, Q) values are summed and then the average is computed. The result is 4.3975 which indicates the two distributions aren’t very close.

The last part of the demo uses some fancy math to estimate KL(P,Q). It turns out that there’s a special case when P is standard Normal ( mean = 0 and sd = 1). In this special case, KL(P,Q) can be computed directly like so:

print("\nEstimating KL(P,Q) using special N(0,1) math ")
q_logvar = T.log(q_sd.pow(2))
KL = -0.5 * T.sum(1 + q_logvar - q_mu.pow(2) - q_logvar.exp())
print("Special case computed KL = %0.4f " % KL)

This is not at all obvious and is one of the many “magic math” equations in machine learning. Because of this very nice property, systems that need to compute KL for two distributions often assume one or both are N(0,1) even when they might not be — the assumption makes calculations much, much easier.

In machine learning, computing KL divergence between two distributions is most often seen in variational autoencoders, but the technique is used in several other problem scenarios too, such as Bayesian neural networks where each network weight is a Normal distribution instead of a single fixed value.



I’ve always thought that an effective travel poster is one that finds the right distance between reality and a happy fantasy ideal. Here are three old Hawaii travel posters like the ones that contributed to a huge increase in Hawaii tourism starting in the late 1950s. Left: By artist Bern Hill. Center: By artist Paul Lawler. Right: By artist Joseph Feher.


Complete demo code:

# kl_estimate_demo.py

# import numpy as np
import torch as T

print("\nBegin KL continuous Monte Carlo demo ")
T.manual_seed(0)

print("\nCreating P = (N, 0, 1) and Q = (N, 3, 1.4) ")
p_mu = T.tensor([0.0], dtype=T.float32)
p_sd = T.tensor([1.0], dtype=T.float32)

q_mu = T.tensor([3.0], dtype=T.float32)
q_sd = T.tensor([1.4], dtype=T.float32)

P = T.distributions.Normal(p_mu, p_sd)
Q = T.distributions.Normal(q_mu, q_sd)

print("\nSetting z = 2.5 ")
z = T.tensor([2.5], dtype=T.float32)

log_p_df_z = P.log_prob(z)
p_df_z = T.exp(log_p_df_z).item()

log_q_df_z = Q.log_prob(z)
q_df_z = T.exp(log_q_df_z).item()

print("Semi-likelihood z comes from P = %0.4f " % p_df_z) # .0175
print("Semi-likelihood z comes from Q = %0.4f " % q_df_z) # .2674

n_samples = 100

print("\nEstimating KL(P,Q) using %d samples " % n_samples)
sum = 0.0
for i in range(n_samples):
  z = Q.rsample()
  kl = Q.log_prob(z) - P.log_prob(z)  # magic math
  sum += kl.item()

KL = sum / n_samples
print("Estimated Monte Carlo KL = %0.4f " % KL)

print("\nEstimating KL(P,Q) using special N(0,1) math ")
# KLD = 0.5 * sum(1 + log(sigma^2) - u^2 - sigma^2)
q_logvar = T.log(q_sd.pow(2))
KL = -0.5 * T.sum(1 + q_logvar - q_mu.pow(2) - q_logvar.exp())
print("Special case computed KL = %0.4f " % KL)

print("\nEnd demo ")
Posted in Machine Learning, PyTorch | Leave a comment

The Wasserstein Distance Using C#

The Wasserstein distance has many different variations. In its simplest form the Wasserstein distance function measures the distance between two discrete probability distributions For example, if:

 double[] P = new double[]
   { 0.6, 0.1, 0.1, 0.1, 0.1 };
 double[] Q1 = new double[]
   { 0.1, 0.1, 0.6, 0.1, 0.1 };
 double[] Q2 = new double[]
   { 0.1, 0.1, 0.1, 0.1, 0.6 }; 

Wasserstein(P, Q1) = 1.00
Wasserstein(P, Q2) = 2.00

Conceptually, if P is considered to be piles of dirt and Q is considered to be holes, then Wasserstein(P, Q) is the minimum amount of work (amount of dirt times distance moved) needed to transfer all dirt to the holes. Or you can think of Wasserstein as the effort required to transform P into Q.

I use the Python language for most of my machine learning projects, but sometimes I use the C# language. I coded up a quick demo of a highly simplified Wasserstein distance function:

using System;
namespace Wasserstein
{
  class Program
  {
    static void Main(string[] args)
    {
      Console.WriteLine("\nBegin demo \n");

      double[] P = new double[]
        { 0.6, 0.1, 0.1, 0.1, 0.1 };
      double[] Q1 = new double[]
        { 0.1, 0.1, 0.6, 0.1, 0.1 };
      double[] Q2 = new double[]
        { 0.1, 0.1, 0.1, 0.1, 0.6 };

      double wass_p_q1 = MyWasserstein(P, Q1);
      double wass_p_q2 = MyWasserstein(P, Q2);

      Console.WriteLine("Wasserstein(P, Q1) = " +
        wass_p_q1.ToString("F4"));
      Console.WriteLine("Wasserstein(P, Q2) = " +
        wass_p_q2.ToString("F4"));

      Console.WriteLine("\nEnd demo ");
      Console.ReadLine();
    }  // Main

    static int FirstNonZero(double[] vec)
    {
      int dim = vec.Length;
      for (int i = 0; i  0.0)
          return i;
      return -1;
    }

    static double MoveDirt(double[] dirt, int di,
      double[] holes, int hi)
    {
      double flow = 0.0;
      int dist = 0;
      if (dirt[di]  holes[hi])
      {
        flow = holes[hi];
        dirt[di] -= flow;
        holes[hi] = 0.0;
      }
      dist = Math.Abs(di - hi);
      return flow * dist;
    }

    static double MyWasserstein(double[] p, double[] q)
    {
      double[] dirt = (double[])p.Clone();
      double[] holes = (double[])q.Clone();
      double totalWork = 0.0;
      while (true)
      {
        int fromIdx = FirstNonZero(dirt);
        int toIdx = FirstNonZero(holes);
        if (fromIdx == -1 || toIdx == -1)
          break;
        double work = MoveDirt(dirt, fromIdx,
          holes, toIdx);
        totalWork += work;
      }
      return totalWork;
    }
  }  // Program
}  // ns

There are many complex variations of Wasserstein. My C# Wasserstein demo works only with discrete probability distributions where each data item is a single-valued probability.

Python is the programming language of choice for most machine learning systems. But C# is often the language of choice for business-related systems so it’s nice to be able to implement ML functions in C# when needed, rather than try to glue a Python ML system to a C# business system.



Ray Cummings (1887-1957) was one of the early pioneers of science fiction. Decades ago, the conceptual distance between science fiction and reality was much greater than today. For example, in the early 1900s when the first airplanes could barely fly, stories about space travel must have seemed impossible. Today, sending a probe to Mars is almost commonplace.

Posted in Machine Learning | 1 Comment

NFL 2021 Week 1 Predictions – Zoltar Likes Six Underdogs

Zoltar is my NFL football prediction computer program. It uses reinforcement learning and a neural network. Here are Zoltar’s predictions for week #1 of the 2021 season. These predictions are tentative, in the sense that it usually takes Zoltar about three weeks to hit his stride.

Zoltar:  buccaneers  by    6  dog =     cowboys    Vegas:  buccaneers  by  7.5
Zoltar:     falcons  by    2  dog =      eagles    Vegas:     falcons  by  3.5
Zoltar:       bills  by    4  dog =    steelers    Vegas:       bills  by  6.5
Zoltar:    panthers  by    6  dog =        jets    Vegas:    panthers  by    5
Zoltar:     vikings  by    0  dog =     bengals    Vegas:     vikings  by    3
Zoltar:       colts  by    2  dog =    seahawks    Vegas:    seahawks  by  2.5
Zoltar:       lions  by    2  dog = fortyniners    Vegas: fortyniners  by  7.5
Zoltar:      texans  by    6  dog =     jaguars    Vegas:     jaguars  by    3
Zoltar:      titans  by    6  dog =   cardinals    Vegas:      titans  by   23
Zoltar:    redskins  by    3  dog =    chargers    Vegas:    chargers  by    1
Zoltar:      chiefs  by    6  dog =      browns    Vegas:      chiefs  by    6
Zoltar:      saints  by    2  dog =     packers    Vegas:      saints  by  4.5
Zoltar:    dolphins  by    0  dog =    patriots    Vegas:    patriots  by    3
Zoltar:      giants  by    4  dog =     broncos    Vegas:     broncos  by  2.5
Zoltar:        rams  by    5  dog =       bears    Vegas:        rams  by  7.5
Zoltar:      ravens  by    0  dog =     raiders    Vegas:      ravens  by  4.5

Zoltar theoretically suggests betting when the Vegas line is “significantly” different from Zoltar’s prediction. In mid-season I use 3.0 points difference. For this first week of the season, I went a bit more conservative and used 4.0 points difference or more as the advice criterion.

At the beginning of the season, because of Zoltar’s initialization and other algorithms, Zoltar is strongly biased towards Vegas underdogs.

1. Zoltar likes Vegas underdog Colts against the Seahawks.
2. Zoltar likes Vegas underdog Lions against the 49ers.
3. Zoltar likes Vegas underdog Texans against the Jaguars.
4. Zoltar likes Vegas underdog Cardinals against the Titans.
5. Zoltar likes Vegas underdog Giants against the Broncos.
6. Zoltar likes Vegas underdog Raiders against the Ravens.

For example, a bet on the Vegas underdog Colts against the Seahawks will pay off if the Colts win by any score, or if the favored Seahawks win but by less than 2.5 points (in other words, win by 2 points or fewer, in other words, win by 1 point or 2 points).

Theoretically, if you must bet $110 to win $100 (typical in Vegas) then you’ll make money if you predict at 53% accuracy or better. But realistically, you need to predict at 60% accuracy or better.

Just for fun, I track how well Zoltar does when just trying to predict just which team will win a game. This isn’t useful except for parlay betting. I’ll post results next week.

Zoltar sometimes predicts a 0-point margin of victory. There are three such games in week #1 (Vikings-Bengals, Dolphins-Patriots, Ravens-Raiders). In those situations, to pick a winner (only so I can track raw number of correct predictions) in the first few weeks of the season, Zoltar picks the home team to win (Bengals, Patriots, Raiders). After that, Zoltar uses his algorithms to pick a winner.



My system is named after the Zoltar fortune teller machine that you can find in arcades. Arcade Zoltar is named after the Zoltar machine that appeared in the 1988 movie “Big”. I suspect that movie Zoltar was named after the 1960s arcade fortune teller machine Zoltan.

Posted in Zoltar | Leave a comment

A Recap of the 2021 National Homeland Security Conference

I gave a short talk at the 2021 National Homeland Security Conference. See https://www.nationalhomelandsecurity.org/. The event was held from August 30 through September 2.

The event Web site states, “The National Homeland Security Conference brings together professionals in Homeland Security, Law Enforcement, Fire and Emergency Management. They include officials in federal agencies, nonprofit agencies, business owners, universities and decision makers to learn about emerging trends in homeland security.”

I estimate the event had about 2,000 attendees but I could be way off on that guess. Because of the way the conference was laid out, it was difficult to see how many people were there at any given time. Most of the attendees I talked to worked for State and Federal government agencies, but there were people from industry and academia too.

My talk was titled “AI and ML for Cyber Threat Prevention”. I described different techniques for anomaly detection, such as deep neural autoencoder reconstruction error.

My talk was an out-of-band presentation and was set up by one of the sponsoring companies for a private audience during a break from the regular session times. The conference focused mostly on things like planning and logistics, rather than technical topics like mine, and so my talk wouldn’t have had broad appeal. But the audience members who did show up were very enthusiastic and asked good questions.

One of the main reasons I speak at conferences is to gain insights into trends in the business and government communities. This information helps me prioritize the projects I do at work. My key takeaway from the 2021 NHSC event was that many government organizations seem to be mostly reactive — that is, planning what to do when something bad happens — rather than proactive — trying to prevent disasters from happening in the first place.



Seconds before disaster.

Posted in Conferences | Leave a comment