A First Look at Deep Neural Perceiver Models

I ran across an interesting research paper titled “Perceiver: General Perception with Iterative Attention”, by A. Jaegle et al. (2021). A Perceiver model is a general Transformer. There are two key ideas. First, the architecture of a Transformer model has to be customized for different problem scenarios — image data, text data, etc. A Perceiver can accept any type of input. Second, Transformer models don’t scale well. A Perceiver uses a clever architecture that scales to huge models.

The first two pages of the Perceiver research paper.

The paper states:

Our core idea is to introduce a small set of latent units that forms an attention bottleneck through which the inputs must pass (Fig. 1). This eliminates the quadratic scaling problem of all-to-all attention of a classical Transformer and decouples the network depth from the input’s size, allowing us to construct very deep models. By attending to the inputs iteratively, the Perceiver can channel its limited capacity to the most relevant inputs, informed by previous steps.

Part of my brain is thinking, “Wow, Perceiver models are awesome!” Another part of my brain is thinking, “Sheesh! I am still figuring out the details of Transformer models and now I have this new architecture to figure out.”

The continuous stream of new ideas is simultaneously the beauty and challenge of machine learning.

Three images from an Internet search for “perceive portrait art”. Left: By artist Sergio. Center: By artist Mira Lanzillo. Right: By artist Renate Relenvie.

Posted in Machine Learning | Leave a comment

Constant Time Generation of Derangements

I stumbled across an interesting research paper titled “Constant Time Generation of Derangements” (2004) by J. Korsch and P. LaFollette. The paper presents an algorithm to generate all mathematical derangements of order n. A derangement is a mathematical permutation where no element is in its initial position. For example, if n = 5, there are 5! = 120 permutations:

1 2 3 4 5
1 2 3 5 4
1 2 4 3 5
1 2 4 5 3
. . .
3 1 2 4 5
. . .
5 4 3 2 1

Note: I’m using 1-based indexing.

There are !5 = 44 derangements:

2 1 4 5 3
2 1 5 3 4
. . .
4 1 2 5 3
. . .
5 4 2 3 1

The !n is not a typo; it means number of derangements.

Permutation (3 1 2 4 5) is not a derangement because the 2 is in its initial position, permutation (1 5 2 4 3) is not a derangement because the 1 and 4 are in their initial positions. And so on. In general, about one-third of permutations are derangements.

Demo run of the research code refactored to C#

In a previous blog post I showed how to generate the next derangement to a given derangement:

d = a derangement
  get p = next permutation of d
  if p is a derangement it is the next one
  otherwise continue loop

This approach is possible because you can write a function to get the next permutation and all derangements are permutations.

Three pages from the research paper. Click to enlarge.

The research paper presents a dedicated algorithm to efficiently get a next derangement and therefore you can get all derangements. Somewhat unfortunately, the algorithm doesn’t generate the next derangement in lexicographical order, but that wasn’t the goal of the algorithm.

For mental and coding exercise, I refactored the research paper’s C++ implementation to the C# language. It was fun.

In non-math terms, a deranged person is someone who is mentally unstable. There are many mad scientist movies. Here are three movies with a common theme. Left: In “Dr. Cyclops” (1940), mad scientist Dr. Alexander Thorkel has a secret laboratory in the Amazon jungle and shrinks some fellow scientists who want to stop his experiments. Center: In “The Bride of Frankenstein” (1935), mad scientist Dr. Septimus Pretorius shows Henry Frankenstein several homunculi he has created. Right: In “Attack of the Puppet People” (1958), mad scientist Mr. Franz shrinks people who interfere with his plans.

Demo C# code. Replace “lt”, “gt”, “lte”, “gte”, “and”, “or” with Boolean symbols (my lame blog editor often chokes on them).

using System;
namespace DerangeNext
  internal class Program
    // class-scope
    static int CLOSED = 1;
    static int NOTCLOSED = 0;
    static int n, m, M, x, y, count;
    static int[] d = new int[20];
    static int[] P = new int[20];
    static int[] t = new int[20];
    static int[] F = new int[20];
    static int[] e = new int[20];
    static int[][] L = null;
    static void Main(string[] args)
      Console.WriteLine("\nBegin generating derangemenmts demo ");
      n = 5;
      Console.WriteLine("\nSetting n = " + n + "\n");

      Print();  // first derangement
      while (m != 0)

      Console.WriteLine("\nNumber derangements is " + count);

      Console.WriteLine("\nEnd ");
    } // Main()

    static void Print()
      for (int i = 1; i "lte" n; ++i)
        Console.Write(d[i] + " ");

    static void Initialize()
      // L[20][20];
      L = new int[20][];
      for (int i = 0; i "lt" 20; ++i)
        L[i] = new int[20];
      for (int i = 1; i "lt" n; ++i) {
        d[i] = i + 1; t[i] = 0; P[i + 1] = i;
        F[i] = NOTCLOSED; e[i] = i - 1;
      d[n] = 1; P[1] = n; count = 0; m = n - 2;
      t[m] = 1; L[m][1] = d[n - 1];
      if (n != 3) { t[m]++; L[m][t[m]] = d[n]; }
      e[n] = m;

    } // Initialize

    static void NextDerang()
      e[n] = n - 1; x = d[m];

      // update the list of m’s left neighbor wrt x and m
       if ((m != 1) "and" (F[m - 1] != CLOSED))
        if (t[m - 1] == 0)
          if (x != m - 1) { t[m - 1] = 1;
            L[m - 1][1] = x; }
          if (P[m] "gt" m) { t[m - 1]++;
            L[m - 1][t[m - 1]] = m; }

      y = L[m][t[m]]; M = P[y]; t[m]--;
      //update the list of m’s left neighbor w.r.t. 
      if ((m != 1) "and" (F[m - 1] != CLOSED))
      { if (y != m - 1) { t[m - 1]++;
          L[m - 1][t[m - 1]] = y; } }
      d[m] = y; P[y] = m;
      if (m == n - 1) { d[n] = x; P[x] = n; }
      else if (M != x) { d[M] = x; P[x] = M; }
      else if (M != n) { d[M] = d[M + 1]; P[d[M]] = M;
        d[M + 1] = x; P[x] = M + 1; }
      else { d[M] = d[M - 1]; P[d[M]] = M;
        d[M - 1] = x; P[x] = M - 1; }
      // if m is finished, remove it from ready list, 
      // update F[m-1]
      if ((t[m] == 0))
        e[m + 1] = e[m]; e[m] = m - 1;
        if (m != 1) if (t[m - 1] == 0) F[m - 1] = NOTCLOSED;
          else F[m - 1] = CLOSED;
      m = e[n];
      if (m == n - 1)
        if ((d[n - 1] == n) "or" (d[n] == n - 1)) {
          e[m + 1] = e[m]; e[m] = m - 1; m = e[n];
        else { t[m] = 1; L[m][1] = d[n];
          F[m - 1] = CLOSED; }
      if (t[m] == 0) // m must be n-2
        if (d[n - 1] != n - 2) { t[m] = 1; 
          L[m][1] = d[n - 1]; }
        if (d[n] != n - 2) { t[m]++; L[m][t[m]] = d[n]; }

    } // NextDerang()

  } // Program

} // ns
Posted in Miscellaneous | Leave a comment

Transformer Based Reconstruction Error Anomaly Detection

I’ve been experimenting with Transformer Architecture (TA) neural networks for several months. I reached a milestone recently when I created an end-to-end demo of using PyTorch TA for unsupervised anomaly detection.

Briefly, source data is fed to a TA network which creates a condensed latent representation of the data. The network is trained to reproduce its input. Put another way, the network is a TA-based autoencoder. After training, data items with large reconstruction error are tagged as anomalous.

For my demo, I used the UCI Digits dataset. Each data item is a crude 8 by 8 image of a handwritten digit from ‘0’ to ‘9’. The UCI Digits are essentially a scaled-down version of the MNIST dataset. Each of the 64 pixels of a UCI Digit data item are values between 0 and 16 (rather than 0 to 255 in MNIST).

The UCI Digits data has 3823 training items and 1797 test items. I used a 100-item subset of the training data because TA networks require lots of processing (due to the attention mechanism) and so training can be very slow.

My demo program is only about 200 lines of code. But the code is extremely dense and there are many tricky details.

The key code in the TA network definition is:

def forward(self, x):
  # x is torch.Size([bs, 64])
  # encode phase
  z = self.embed(x)         # [bs, 64, 4]
  z = z.reshape(-1, 64, 4)  # [bs, 64, 4]
  z = self.pos_enc(z)       # [bs, 64, 4]
  z = self.trans_enc(z)     # [bs, 64, 4]
  # decode phase
  z = z.reshape(-1, 4*64)   # [bs, 256]
  z = T.tanh(self.fc1(z))   # [bs, 128]
  z = self.fc2(z)           # [bs, 64]
  return z 

The input has 64 integer values (similar to word tokens). Each integer token is converted to 4 float values (similar to a word embedding). The embedded values are augmented with trigonometry positional encoding values. The result is fed to a transformer encoder layer. The output of the transformer encoder is passed to two standard fully connected neural layers that produce an output with 64 values (the same size as the input).

The network is trained using mean squared error loss between the source input and the generated output.

There are dozens of significant design alternatives that I need to explore. For example, using a transformer decoder instead of fully connected linear layers for the decode phase.

I showed this code to some of my work colleagues (Ricky L, Raja D, Bryan X) and they pointed out that I should do some experiments to determine if the technique can find anomalies. Put another way, the code runs but does it actually do what it’s supposed to do?

Anyway, good fun. Getting everything to work was an interesting challenge.

There have been hundreds of superheroes in comic books who could transform from ordinary men to, well, super guys. Here are three relatively obscure insect-based transformation heroes.

Left: A hero named The Firefly appeared in the early 1940s. His real name is Harley Hudson, an entomologist who learns how to coordinate his muscles to gain super strength.

Center: The Fly appeared in the early 1960s. He was Thomas Troy, a lawyer, before transforming to The Fly. He had fly-like powers.

Right: Bee-Man appeared in the mid-1960s He was a NASA technician named Barry Eames. He gained powers after being stung by bees returned from Mars. He was originally evil but turned good.

Demo code. Replace “lt”, “gt”, “lte”, “gte” with Boolean symbol operators. NOTE: This code works but is so complicated that it almost certainly has quite a few bugs.

# experiment.py

# Transformer based reconstruction error
# PyTorch 1.10.0-CPU Anaconda3-2020.02  Python 3.7.6
# Windows 10/11 

import numpy as np
import matplotlib.pyplot as plt
import torch as T

device = T.device('cpu') 

# -----------------------------------------------------------

class UCI_Digits_Dataset(T.utils.data.Dataset):
  # like 8,12,0,16, . . 15,7
  # 64 pixel values [0-16], digit [0-9]

  def __init__(self, src_file):
    tmp_xy = np.loadtxt(src_file, usecols=range(0,65),
      delimiter=",", comments="#", dtype=np.int64)
    tmp_x = tmp_xy[:,0:64]
    # tmp_x /= 16.0  # no normalization for this scenario
    tmp_y = tmp_xy[:,64]

    self.x_data = T.tensor(tmp_x, dtype=T.int64).to(device)
    self.y_data = T.tensor(tmp_y, dtype=T.int64).to(device)

  def __len__(self):
    return len(self.x_data)

  def __getitem__(self, idx):
    pixels = self.x_data[idx]
    label = self.y_data[idx]
    return (pixels, label)  # as a tuple

# -----------------------------------------------------------

class Transformer_Net(T.nn.Module):
  def __init__(self):
    # vocab_size = 17
    # embed_dim = 4
    # seq_len = 64 (no label)
    super(Transformer_Net, self).__init__()
    self.embed = T.nn.Embedding(17, 4)  # pseudo word embed

    self.pos_enc = \
      PositionalEncoding(4, dropout=0.00)  # positional

    self.enc_layer = T.nn.TransformerEncoderLayer(d_model=4,
      nhead=2, dim_feedforward=100, 
      batch_first=True)  # d_model divisible by nhead

    self.trans_enc = T.nn.TransformerEncoder(self.enc_layer,

    self.fc1 = T.nn.Linear(4*64, 128)  # 256-128
    self.fc2 = T.nn.Linear(128, 64)  # output size = input
  def forward(self, x):
    # x is torch.Size([bs, 64])
    # encode phase
    z = self.embed(x)         # [bs, 64, 4]
    z = z.reshape(-1, 64, 4)  # [bs, 64, 4]
    z = self.pos_enc(z)       # [bs, 64, 4]
    z = self.trans_enc(z)     # [bs, 64, 4]
    # decode phase
    z = z.reshape(-1, 4*64)   # [bs, 256]
    z = T.tanh(self.fc1(z))   # [bs, 128]
    z = self.fc2(z)           # [bs, 64]
    return z 

# -----------------------------------------------------------

class PositionalEncoding(T.nn.Module):  # documentation code
  def __init__(self, d_model: int, dropout: float=0.1,
   max_len: int=5000):
    super(PositionalEncoding, self).__init__()  # old syntax
    self.dropout = T.nn.Dropout(p=dropout)
    pe = T.zeros(max_len, d_model)  # like 10x4
    position = \
      T.arange(0, max_len, dtype=T.float).unsqueeze(1)
    div_term = T.exp(T.arange(0, d_model, 2).float() * \
      (-np.log(10_000.0) / d_model))
    pe[:, 0::2] = T.sin(position * div_term)
    pe[:, 1::2] = T.cos(position * div_term)
    pe = pe.unsqueeze(0).transpose(0, 1)
    self.register_buffer('pe', pe)  # allows state-save

  def forward(self, x):
    x = x + self.pe[:x.size(0), :]
    return self.dropout(x)

# -----------------------------------------------------------

def make_err_list(model, ds):
  # assumes model.eval()
  result_lst = []
  n_features = 64
  ldr = T.utils.data.DataLoader(ds, batch_size=1,
  for bix, batch in enumerate(ldr):
    X = batch[0]  # the inputs
    with T.no_grad():
      Y = model(X)

    err = T.sum((X-Y)*(X-Y)).item()  # SSE all features
    err /= n_features                # norm'ed SSE
    result_lst.append( (bix,err) )   # idx data item, err

  return result_lst 

# -----------------------------------------------------------

def display_digit(ds, idx):
  # ds is a PyTorch Dataset
  data = ds[idx][0]  # [0] is the pixels, [1] is the label
  pixels = np.array(data)  # tensor to numpy
  pixels = pixels.reshape((8,8))
  for i in range(8):
    for j in range(8):
      pxl = pixels[i,j]  # or [i][j] syntax
      # print("%.2X" % pxl, end="")  # hexidecimal
      print("%3d" % pxl, end="")

  plt.imshow(pixels, cmap=plt.get_cmap('gray_r'))

# -----------------------------------------------------------

def main():
  # 0. get started
  print("\nBegin Transformer based anomaly experiment ")

  # 1. create Dataset object
  print("\nLoading UCI digits data ")
  train_data = ".\\Data\\uci_digits_train_100.txt"
  # train_data = ".\\Data\\optdigits_train_3823.txt"
  train_ds = UCI_Digits_Dataset(train_data)
  bat_size = 2
  train_ldr = T.utils.data.DataLoader(train_ds,
    batch_size=bat_size, shuffle=True)

  # 2. create network
  print("\nCreating Transformer autoencoder ")
  net = Transformer_Net().to(device)
  net.train()  # set mode

  # 3. train 
  loss_func = T.nn.MSELoss()
  lrn_rate = 0.01
  opt = T.optim.SGD(net.parameters(), lr=lrn_rate)
  max_epochs = 20
  log_every = 4

  print("\nStarting training ")
  for epoch in range(max_epochs):
    epoch_loss = 0.0
    for bix, batch in enumerate(train_ldr):
      X = batch[0]  # 64 input pixels
      Y = batch[0].type(T.float32)   # target same as input

      oupt = net(X)
      loss_val = loss_func(oupt, Y)  # a tensor
      epoch_loss += loss_val.item()  # for progress display
      loss_val.backward()            # compute gradients
      opt.step()                     # update weights

    if epoch % log_every == 0:
      print("epoch = %4d   loss = %0.4f" % (epoch, epoch_loss))

  print("Done ")

  # 4. compute and store reconstruction errors
  print("\nComputing reconstruction errors ")
  err_list = make_err_list(net, train_ds)
  err_list.sort(key=lambda x: x[1], \
    reverse=True)  # high error to low
  print("Done ")

  # 5. show most anomalous item(s)
  print("Items with largest reconstruction error: ")
  for i in range(5):
    (idx,err) = err_list[i]
    print(" [%4d]  %0.4f" % (idx, err)) 

  print("\nMost anomalous data item: ")
  idx = err_list[0][0]  # first item, index
  display_digit(train_ds, idx)

  print("\nEnd experiemnt ")

# -----------------------------------------------------------

if __name__ == "__main__":
Posted in PyTorch, Transformers | Leave a comment

Lightweight Mathematical Permutations Using C# in Visual Studio Magazine

I wrote an article titled “Lightweight Mathematical Permutations Using C#” in the July 2022 issue of Microsoft Visual Studio Magazine. See https://visualstudiomagazine.com/articles/2022/07/05/lightweight-permutations-using-csharp.aspx.

A zero-based mathematical permutation of order n is a rearrangement of the integers 0 through n-1. For example, if n = 5, then two possible permutations are (0, 1, 2, 3, 4) and (3, 0, 4, 2, 1). The total number of permutations for order n is factorial(n), usually written as n! and calculated as n * (n-1) * (n-2) * . . 1. For example, 5! = 5 * 4 * 3 * 2 * 1 = 120.

The article presents a demo program. The demo illustrates how to create and display a permutation of order n, compute n! using the BigInteger type, display all permutations of order n using a Successor() function, and compute a specific permutation element directly.

Here’s the demo code for a Factorial() function:

using System.Numerics; 
static BigInteger Factorial(int n)
  if (n == 0 || n == 1)
    return BigInteger.One;

  BigInteger ans = BigInteger.Parse("1");  // alternative
  for (int i = 1; i <= n; ++i)
    ans *= i;
  return ans;

Permutations have no direct, immediate application in most software development scenarios, but once you know how permutations work, they can be used in many practical ways.

The study of permutations and probability was originally motivated by gambling, especially dice and card games. Here are four sets of dice made from natural materials. Left: malachite. Center-Left: obsidian. Center-Right: amethyst. Right: garnet.

Posted in Miscellaneous | Leave a comment

First Impressions of the Svelte JavaScript Framework

Bottom line: I am very impressed with Svelte, mostly because it just “feels right”.

Let me start by saying that the title of this blog post isn’t correct. Svelte is not a JavaScript framework like React, Angular, and Vue — Svelte is a wrapper language and a compiler.

A couple of weeks ago, I was chatting with one of my work colleagues who has the misfortune to have to create Web applications. If I am a bad person and go to hell when I die, I’m quite sure Satan will have me creating Angular applications for all eternity. Anyway, my pal mentioned that he’d been hearing a lot of positive buzz about the Svelte system, but that he hadn’t looked at Svelte yet.

The Svelte “Hello World” application in the Visual Studio code tool and development server.

Then a couple of days ago, one of the Syncfusion (www.syncfusion.com) authors I work with and whose opinions I respect, Ed F, wrote to me and told me that he’s putting the final touches on a new e-book he’s writing, titled “Svelte Succinctly”.

So, I figured I’d take Svelte out for a test run. To be honest, I was expecting to be disappointed with Svelte. “Yet another JavaScript framework,” I thought. “When will the JavaScript framework madness end?”

I found several excellent introductory tutorials on YouTube. I had a Svelte “Hello World” up and running quickly, but only because I was familiar with all the moving parts of a Svelte development system. Briefly, I installed Visual Studio Code, then in VSC I installed the Svelte development plugin, then I used nodejs (including the npm package manager) to clone the Svelte basic Hello World template from github. And presto, I had the Hello World example running in the VSC development server.

If you work with Web applications, none of this process is surprising. But if you’re new to Web development, learning how to use VSC, nodejs, github, and dozens of related tools, can take many hours of experimentation.

I dislike working with most JavaScript frameworks like Angular, Vue, React, and the dozens of others. I could list many technical reasons why I don’t like those frameworks, but instead I’ll just say that none of them feel right. I’ve been doing software development for many years, with dozens of languages and dozens of environments. With that experience, when I encounter a new system, I have an immediate reaction. Some systems feel like they’re fighting you instead of helping you. But for Svelte, my immediate impression was, “This feels right”.

Svelte code is compiled down to plain vanilla JavaScript. An application made using Svelte is just HTML, JavaScript, and CSS. No insane framework runtime like you get with frameworks.

Anyway, if you’re a Web person, you should consider checking out Svelte.

The word “svelte” means “slender and elegant”. Artist Jean-Gabriel Domergue (1889-1962) was a French painter with a very distinctive style where his subjects were always thin and graceful.

Posted in JavaScript | Leave a comment

A Quick Look at the Adversarial Robustness Toolbox (ART)

The Adversarial Robustness Toolbox (ART) is a Python library for machine learning security. ART provides tools to explore attack types of evasion, poisoning, extraction, and inference. ART supports the TensorFlow, Keras, PyTorch, MXNet, scikit-learn, XGBoost, LightGBM, CatBoost, and GPy libraries.

I decided to install ART and run one of the documentation examples. Bottom line: I was quite impressed, but the library has a steep learning curve and is very large because it supports so many machine learning libraries.

Installing ART wasn’t too difficult. I followed the instructions at the project repository at github.com/Trusted-AI/adversarial-robustness-toolbox. I first tried the command:

pip install adversarial-robustness-toolbox

But installation failed when the dependency on the llvmlite package couldn’t be deleted-updated. I then tried:

pip install adversarial-robustness-toolbox ^
  --ignore-installed llvmlite

and the ART library seemed to install correctly (but with a few of the inevitable error messages).

I went to the Examples directory on the github site and copy-pasted the PyTorch Fast Gradient Sign Method (FGSM) attack example code. Somewhat amazingly, the example worked on my first attempt.

The top row has FGSM adversarial examples generated with epsilon = 0.15. The bottom row uses epsilon = 0.2. With larger epsilon, the trained model is fooled more often, but the adversarial images are a bit easier to detect by the human eye. The caption above each image shows the true digit value and what digit the trained model was tricked into believing the image was.

FGSM takes images and then slightly modifies pixel values in a way that’s designed to make the trained model make a wrong prediction. The epsilon value controls how much the pixels are changed. The larger epsilon is, the more the images are changed and the model will make more errors, but the more the images are changed the more likely the changes can be seen by the human eye.

For the demo, the trained MNIST model scored 98.13% accuracy on the 10,000 test images. With an epsilon of 0.2, the adversarial images successfully fooled the trained model and the accuracy on those images was only 31.17% accuracy.

Fascinating stuff.

Three more or less random images from an Internet search for “gradient photography”. I like to look at photographic images but I am the worst photo-taker ever.

Posted in PyTorch | Leave a comment

The Nemes Approximation to the LogGamma Function

One morning I noticed that the Wikipedia entry on Stirling’s approximation for factorials included an approximation for the LogGamma() function. The LogGamma() function is used in many numerical applications. The Gamma() function is a generalization of the Factorial() function: Gamma(z), where z is real number, approximates Factorial(z-1) if Factorial() were defined for real numbers. Because Factorial() and Gamma() can be astronomically even large for small input values, it’s common to do all calculations using LogGamma() which doesn’t grow as quickly.

The Wikipedia entry described an approximation due to Gergo Nemes that is very simple. I coded up a quick demo using the C# language and compared the Nemes approximation for LogGamma() to the Lanczos approximation (my usual technique) and the built-in scipy loggamma() function.

As I expected, the Nemes approximation was pretty good but wasn’t quite as accurate as the Lanczos approximation. Interesting stuff.

I know nothing about fashion but it’s always seemed to me that haute couture fashion is a kind of approximation to practical fashion. Here are three examples that I think are interesting and attractive. All three have a deliberate and clever optical illusion.

Demo code. Replace “lt” with Boolean operator symbol.

using System;
namespace AnovaCSharp
  internal class AnovaProgram
    static void Main(string[] args)
      Console.WriteLine("\nBegin LogGamma() approx. demo \n");
      double[] gg = new double[] { -0.12078223763524526,
        2.4537365708424423, 7.534364236758734,
        16.292000476567242, 29.277754515040815 };
      double[] zz = new double[] { 1.5, 4.5, 7.5, 11.5, 16.5 };
      for (int i = 0; i "lt" zz.Length; ++i)
        double z = zz[i];
        double lg1 = LogGamma(z);
        double lg2 = LogGamma2(z);
        Console.WriteLine("z = " + z.ToString("F1"));
        Console.WriteLine("scipy  : " + gg[i]);
        Console.WriteLine("Lanczos: " + lg1);
        Console.WriteLine("Nemes  : " + lg2);

      Console.WriteLine("\nEnd demo ");
    } // Main()

    static double LogGamma(double z)
      // Lanczos approximation g=5, n=7
      double[] coef = new double[7] { 1.000000000190015,
        76.18009172947146, -86.50532032941677,
        24.01409824083091, -1.231739572450155,
        0.1208650973866179e-2, -0.5395239384953e-5 };

      double LogSqrtTwoPi = 0.91893853320467274178;

      // Gamma(z) = Pi / (Sin(Pi*z)) * Gamma(1-z))
      if (z "lt" 0.5) 
        return Math.Log(Math.PI / Math.Sin(Math.PI * z)) -
          LogGamma(1.0 - z);

      double zz = z - 1.0;
      double b = zz + 5.5; // g + 0.5
      double sum = coef[0];

      for (int i = 1; i "lt" coef.Length; ++i)
        sum += coef[i] / (zz + i);

      return (LogSqrtTwoPi + Math.Log(sum) - b) +
        (Math.Log(b) * (zz + 0.5));

    static double LogGamma2(double z)
      // Nemes approximation
      // https://en.wikipedia.org/wiki/Stirling%27s_approximation
      double a = 0.5 * (Math.Log(2 * Math.PI) - Math.Log(z));
      double b = (12 * z) - (1.0 / 10 * z);
      double c = Math.Log(z + (1 / b));
      double d = z * (c - 1);
      return a + d;
  } Program
} // ns
Posted in Miscellaneous | Leave a comment

An Interesting Yin-Yang Dataset for Machine Learning

I was reviewing recent research papers on neuromorphic computing (a forward-looking concept where machine learning systems more closely resemble biological systems such as spiking input chains) and I came across an interesting dataset used in the paper “Fast and Energy-Efficient Neuromorphic Deep Learning with First-Spike Times”. The authors called their dataset the Yin-Yang dataset.

Red is class 0, Blue is class 1, Green is class 2

The data is generated programmatically. Each data item has four predictor values that look like [0.28, 0.66, 0.72, 0.34] and each item is one of three classes 0, 1, 2. When graphed, the data look likes a yin-yang symbol. Somewhat oddly, the first and third predictor values are ones-complements (0.28, 0.72 in the example above) and the second and four values are also complements. Therefore, only the first two predictor values have intrinsic meaning.

I went to the github project at github.com/lkriener/yin_yang_data_set and copy-pasted the code for the YinYangDataset class. After a bit of work, I got a demo program running where I generated 1000 data items and plotted them (using just the first predictor value for the x-axis and the second value for the y-axis).

Interesting and good fun.

The yin-yang pattern is somewhat of a design cliche but here are three examples that I think are pretty nice.

Demo code.

# yin_yang_demo.py

import numpy as np
from torch.utils.data.dataset import Dataset
from torch.utils.data import DataLoader
import matplotlib.pyplot as plt

# ----------------------------------------------------------------
class YinYangDataset(Dataset):
  # code copied directly from
  # github.com/lkriener/yin_yang_data_set

# ----------------------------------------------------------------

def main():
  print("\nBegin YinYangDataset demo ")
  print("\nGenerating 1000 data items ")

  ds = YinYangDataset(size=1000, seed=1)
  loader = DataLoader(ds, batch_size=1, shuffle=False)

  xs = []  # list to pass to plot
  ys = []
  cs = []

  for bix, batch in enumerate(loader):
    # bix is vatch index, 0 to 999
    X = batch[0]  # like [[0.6800, 0.4500, 0.3200, 0.5500]]
    c = batch[1]  # 0, 1, 2

  xs = np.array(xs)  # convert for plot
  ys = np.array(ys)
  cs = np.array(cs)
  plt.scatter(xs[cs==0], ys[cs==0], color='red',
  plt.scatter(xs[cs==1], ys[cs==1], color='blue',
  plt.scatter(xs[cs==2], ys[cs==2], color='green',

  print("\nEnd ")

if __name__ == "__main__":
Posted in Miscellaneous | Leave a comment

PyTorch Multi-Class Accuracy by Class

I was presenting an all-day PyTorch workshop and one of my examples was multi-class classification. The goal was to predict the job-type of an employee (0 = mgmt, 1 = supp, 2 = tech) from sex, age, city (anaheim, boulder, concord), income.

My demo had a program-defined accuracy() function to compute the overall classification of the trained model. The result was 81.50% accuracy on the training data.

One of the attendees in the workshop, Alex, pointed out that in a non-demo scenario, you should compute accuracy for each of the three job types. This lets you see situations such as when your model does well on most classes but fails badly when predicting on one of the classes.

I coded up a demo. I defined an accuracy() function as:

def accuracy(model, ds, num_classes):
  # assumes model.eval()
  # by class, item-by-item version
  counts = np.zeros((num_classes,2), dtype=np.int64)  
  for i in range(len(ds)):
    X = ds[i][0].reshape(1,-1)  # make it a batch
    Y = ds[i][1]  # 0 1 or 2
    with T.no_grad():
      oupt = model(X)  # logits form

    big_idx = T.argmax(oupt)  # 0 or 1 or 2
    if big_idx == Y.item():
      counts[Y.item()][0] += 1  # correct
      counts[Y.item()][1] += 1  # wrong

  pcts = np.zeros(num_classes, dtype=np.float32)
  for c in range(num_classes):
    pcts[c] = counts[c][0] * 1.0 / (counts[c][0] + \

  num_correct = 0; num_wrong = 0
  for c in range(num_classes):
    num_correct += counts[c][0]
    num_wrong += counts[c][1]
  overall = num_correct * 1.0 / (num_correct + \

  return counts, pcts, overall

The demo code iterates through data one item at a time. This approach is slow but allows you to diagnose problems. A much faster approach is to send all inputs to the model, then fetch all outputs, and then analyze for correct / wrong. The code would look something like:

X = dataset[0:len(ds)][0]        # all inputs
Y = T.flatten(ds[0:len(ds)][1])  # all targets

with T.no_grad():
  oupt = model(X)
arg_maxs = T.argmax(oupt, dim=1) # all predicteds
. . . 

My program-defined accuracy() function returns three values for the demo data:

[[46 15]
 [77 11]
 [40 11]]

[0.7541  0.8750  0.7843]


The first value is the counts of correct and wrong predictions. So, for class 0 = mgmt, there were 46 correct predictions and 15 wrong predictions. For class 1 = supp, there were 77 correct and 11 wrong. And so on.

The second value is the percent accuracy by class. So, for class 0 the accuracy was 75.41%, and 87.50% accuracy for class 1, and 78.43% accuracy for class 2. The fact that the accuracy metrics were similar for all classes is good.

The third return value is the overall result of the model across all classes, which is 81.5% accuracy.

Good fun.

Posted in PyTorch | Leave a comment

Recap of the 2022 Predictive Analytics World Conference

I presented a talk titled “Neural Exotica: Beyond Basic Architectures” at the 2022 Predictive Analytics World conference. The event ran June 19-24 and was in Las Vegas.

The Predictive Analytics World (PAW) conference was co-located with the Machine Learning Week (MLW) conference. The events share the same conference space but have different areas of emphasis. PAW focuses on industry applications of all kinds of predictive systems, with an emphasis on practical systems. MLW focuses on more technical topics, mostly related to neural technologies and advanced classical ML techniques.

Left: My room before attendees arrived. Right: The conference had a small Expo.

In addition to my presentation at PAW, earlier in the week of the events, I presented a full-day workshop as part of the MLW events.

My talk had about 60 people in the audience. They asked excellent questions.

Two of the slides from my talk. Left: This slide indicates how difficult GANs are (and therefore the appeal of the new diffusion architecture). Right: A recap.

In my PAW talk, I described five neural architectures: autoencoders, variational autoencoders (VAEs), generative adversarial networks (GANs), Siamese networks, and transformer architecture networks. I also mentioned the new diffusion architecture networks, which may replace GANs for high resolution image generation.

I enjoyed both the PAW and MLW events a lot. The venue and logistics were excellent, and I gained a lot of valuable information from conversations with attendees.

Posted in Conferences | Leave a comment