PyTorch v1.4 Neural Network for the Iris Dataset

PyTorch is a neural network library that can use either CPU or GPU processors. As I write this, the latest version of PyTorch is v1.4 which was released in January 2020. I figured I’d take v1.4 out for a test drive to see if my old v1.2 code still works. Result: yes, my old code still works with the 1.4 version of PyTorch.

I immediately ran into a minor problem when trying to install v1.4 of PyTorch. I use pip (rather than conda) as my Python package manager. I prefer to install my Python packages manually from their .whl files. The Web page used to give a link to individual .whl files but the latest Web page gives a pip install command “the latest” instead, which I didn’t want to use.

After a bit of searching, I located the individual .whl files at

Most of my dev machines run Windows and have a CPU (with no GPU) or an older GPU that don’t support the latest builds of GPU PyTorch. I am currently using Python version 3.6.5 (via Anaconda version 5.2.0). So, I downloaded this file: torch-1.4.0+cpu-cp36-cp36m-win_amd64.whl to my local machine. As I write this, I’m reminded that versioning compatibilities in the Python world is still a huge issue, even for experienced people, but especially for people new to Python and PyTorch.

I uninstalled PyTorch v1.2 using the shell command “pip uninstall torch”. Then I installed v1.4 using the command “pip install (the-whl-file). I got an error message of “distributed 1.21.8 requires msgpack, which is not installed” which I ignored. I assume this has something to do with Anaconda.

In all my previous PyTorch program investigations, I simply ignored the “device” issue. My programs just magically worked. I decided I’d explicitly specify the device for each Tensor and Module object. This is a big topic but briefly, when you create a Tensor object, the fundamental data type of PyTorch, you can specify whether it should be processed by a CPU or a GPU. For example:

import torch as T
device = T.device("cpu")
. . . 
X = T.Tensor(data_x[i].reshape((1,n))).to(device)

So I went through my old Iris example script and added explicit to(device) directives. Unfortunately, there were a lot of statements to modify and even if I missed some, my script would still work. The only way to know would be to change the device to GPU and run the script on a machine with a GPU (which I don’t have right now).

Anyway, the moral of the story is that working with PyTorch is very difficult. PyTorch knowledge isn’t something like knowledge of batch files where you can pick it up easily as needed. Working with PyTorch is essentially a full-time job.

Here’s my (possibly buggy on a GPU) Iris program. I’ve substituted “less-than” for the less than operator so my blog software doesn’t go insane.

# PyTorch 1.4.0 Anaconda3 5.2.0 (Python 3.6.5)
# CPU, Windows, no dropout

import numpy as np
import torch as T
device = T.device("cpu")  # apply to Tensor or Module

# -----------------------------------------------------------

class Batcher:
  def __init__(self, num_items, batch_size, seed=0):
    self.indices = np.arange(num_items)
    self.num_items = num_items
    self.batch_size = batch_size
    self.rnd = np.random.RandomState(seed)
    self.ptr = 0

  def __iter__(self):
    return self

  def __next__(self):
    if self.ptr + self.batch_size "greater-than" self.num_items:
      self.ptr = 0
      raise StopIteration  # ugly
      result = self.indices[self.ptr:self.ptr+self.batch_size]
      self.ptr += self.batch_size
      return result

# -----------------------------------------------------------

class Net(T.nn.Module):
  def __init__(self):
    super(Net, self).__init__()
    self.hid1 = T.nn.Linear(4, 7)  # 4-7-3
    self.oupt = T.nn.Linear(7, 3)


  def forward(self, x):
    z = T.tanh(self.hid1(x))
    z = self.oupt(z)  # no softmax. see CrossEntropyLoss() 
    return z

# -----------------------------------------------------------

def accuracy(model, data_x, data_y):
  # data_x and data_y are numpy nd arrays
  N = len(data_x)    # number data items
  n = len(data_x[0])  # number features

  n_correct = 0; n_wrong = 0
  for i in range(N):
    X = T.Tensor(data_x[i].reshape((1,n))).to(device)
    Y = T.LongTensor(data_y[i].reshape((1,1))).to(device)
    oupt = model(X)
    (big_val, big_idx) = T.max(oupt, dim=1)
    if big_idx.item() == data_y[i]:
      n_correct += 1
      n_wrong += 1
  return (n_correct * 100.0) / (n_correct + n_wrong)

def main():
  # 0. get started
  print("\nBegin Iris Dataset using PyTorch demo \n")
  # 1. load data
  print("Loading Iris data into memory \n")
  train_file = ".\\Data\\iris_train.txt"
  test_file = ".\\Data\\iris_test.txt"

  # data looks like:
  # 5.1, 3.5, 1.4, 0.2, 0
  # 6.0, 3.0, 4.8, 1.8, 2
  train_x = np.loadtxt(train_file, usecols=range(0,4),
    delimiter=",",  skiprows=0, dtype=np.float32)
  train_y = np.loadtxt(train_file, usecols=[4],
    delimiter=",", skiprows=0, dtype=np.float32)

  test_x = np.loadtxt(test_file, usecols=range(0,4),
    delimiter=",",  skiprows=0, dtype=np.float32)
  test_y = np.loadtxt(test_file, usecols=[4],
    delimiter=",", skiprows=0, dtype=np.float32)

  # 2. create network
  net = Net().to(device)

  # 3. train model
  lrn_rate = 0.05
  loss_func = T.nn.CrossEntropyLoss()  # applies softmax()
  optimizer = T.optim.SGD(net.parameters(), lr=lrn_rate)
  max_epochs = 100
  N = len(train_x)
  bat_size = 16
  batcher = Batcher(N, bat_size)

  print("Starting training")
  for epoch in range(0, max_epochs):
    for curr_bat in batcher:
      X = T.Tensor(train_x[curr_bat]).to(device)
      Y = T.LongTensor(train_y[curr_bat]).to(device)
      oupt = net(X)
      loss_obj = loss_func(oupt, Y)

    if epoch % (max_epochs/10) == 0:
      print("epoch = %6d" % epoch, end="")
      print("  prev batch loss = %7.4f" % loss_obj.item(), end="")
      acc = accuracy(net, train_x, train_y)
      print("  accuracy = %0.2f%%" % acc) 
  print("Training complete \n")

  # 4. evaluate model
  # net = net.eval()
  acc = accuracy(net, test_x, test_y)
  print("Accuracy on test data = %0.2f%%" % acc) 

  # 5. save model
  print("Saving trained model \n")
  path = ".\\Models\\iris_model.pth", path)

  # 6. make a prediction 
  unk_np = np.array([[6.1, 3.1, 5.1, 1.1]], dtype=np.float32)
  unk_pt = T.tensor(unk_np, dtype=T.float32).to(device) 
  logits = net(unk_pt).to(device)  # do not sum to 1.0
  probs_pt = T.softmax(logits, dim=1).to(device)
  probs_np = probs_pt.detach().numpy()

  print("Predicting species for [6.1, 3.1, 5.1, 1.1]: ")

  print("\n\nEnd Iris demo")

if __name__ == "__main__":

Images from an Internet search for “python clothes”. Left: A python pattern dress and shoes. Center: A man’s python pattern jacket. Right: A brightly-colored python pattern dress. I find these designs oddly attractive.

Posted in Machine Learning, PyTorch | Leave a comment

Machine Learning and Teenage Murderers

Several weeks ago there was a widely reported news story where a young New York college student, Tessa Majors, was murdered by three teenage boys. It was only a few days later when I realized that, to some extent, I had become numbed to such news, as opposed to being shocked.

Left: College student Tessa Majors was murdered by three teenage boys. Right: One of the boys who confessed to the murder.

Whenever I see or think about some phenomenon, I wonder if machine learning can be applied in some way. In the case of teenage murderers, I’m stumped.

I’m generally not too interested in sociology, but from what little I’ve read, it seems as if most teenage murderers fit the same template: male, poorly educated, low intelligence, raised by a single mother who is often dependent on public assistance, and so on.

But this is correlation, not causation. Knowing a particular template provides information about which type of teenagers who are more likely to commit murder, but that information doesn’t explain why such teens commit murders or suggest ways to prevent such teens from committing murders.

The bottom line is that I have no suggestions about how machine learning could be used to reduce the number of murders committed by teenagers. But I hope I’ll never become so numbed to such problems that I stop wondering about such things and thinking about how machine learning can be used for good purposes.

Images from a Google search for “teens arrested murder”. There were hundreds of results like these. Sad.

Posted in Machine Learning | Leave a comment

How to Create a Radial Basis Function Network Using C#

I wrote an article titled, “How to Create a Radial Basis Function Network Using C#” in the March 2020 edition of Visual Studio Magazine. See

A radial basis function (RBF) network is a software system that is similar to a single hidden layer neural network. In my article I explain how to design an RBF network and describe how an RBF network computes its output. I use the C# language but it shouldn’t be difficult to refactor the demo code to another programming language.

I explained RBF networks using a demo program. The demo sets up a 3-4-2 RBF network. There are three input nodes, four hidden nodes, and two output nodes. You can imagine that the RBF network corresponds to a problem where the goal is to predict if a person is male or female based on their age, annual income, and years of education.

The demo program set dummy values for the RBF network’s centroids, widths, weights, and biases. The demo set up a normalized input vector of (1.0, -2.0, 3.0) and sent it to the RBF network. The final computed output values are (0.0079, 0.9921). If the output nodes correspond to (0, 1) = male and (1, 0) = female, then you’d conclude that the person is male.

Each hidden node also has a single width value. The width values are sometimes called standard deviations, and are often given the symbol Greek lower case sigma or lower case English s. In the diagram, s0 is 2.22, s1 is 3.33 and so on.

Each hidden node has a value which is determined by the input node values, and the hidden node’s centroid values and the node’s width value. In the diagram, the value of hidden node [0] is 0.0014, the value of hidden node [1] is 0.2921 and so on.
It is common to place a bell-shaped curve icon next to each hidden node in an RBF network diagram to indicate that the nodes are computed using a radial basis function with centroids and widths rather than using input-to-hidden weights as computed by single hidden layer neural networks.

There is a weight value associated with each hidden-to-output connection. The demo 3-4-2 RBF network has 4 * 2 = 8 weights. In the diagram, w00 is the weight from hidden [0] to output [0] and has value 5.0. Weight w01 is from hidden [0] to output [1] and has value -5.1 and so on.

There is a bias value associated with each output node. The bias associated with output [0] is 7.0 and the bias associated with output [1] is 7.1.

The two output node values of the demo RBF network are (0.0079, 0.9921). Notice the final output node values sum to 1.0 so that they can be interpreted as probabilities. Internally, the RBF network computes preliminary output values of (4.6535, 9.4926). These preliminary output values are then scaled so that they sum to 1.0 using the softmax function.

Three creepy prehistoric animals that have (mostly) radial symmetry. Left: Sollasina cthulhu, a sea cucumber that lived 430 million years ago. Center: Wiwaxia, a marine slug-like creature that lived in the Cambrian Period. Right: Maotianoascus and Ctenrhabdotus, ancient predecessors to jellyfish. Ugh. These are the kind of creatures that give me nightmares.

Posted in Machine Learning | Leave a comment

Researchers Release Open Source Counterfactual Machine Learning Library

I contributed to an article titled “Researchers Release Open Source Counterfactual Machine Learning Library” in the March 2020 edition of the PureAI Web site. See

Counterfactuals are best explained by example. Suppose a loan company has a trained ML model that is used to approve or decline customers’ loan applications. The predictor variables (often called features in ML terminology) are things like annual income, debt, sex, savings, and so on. A customer submits a loan application. Their income is $45,000 with debt = $11,000 and their age is 29 and their savings is $6,000. The application is declined.

A counterfactual is change to one or more predictor values that results in the opposite result. For example, one possible counterfactual could be stated in words as, “If your income was increased to $60,000 then your application would have been approved.”

In general, there will be many possible counterfactuals for a given ML model and set of inputs. Two other counterfactuals might be, “If your income was increased by $50,000 and debt was decreased to $9,000 then your application would have been approved” and, “If your income was increased to $48,000 and your age was changed to 36 then your application would have been approved.” The image below illustrates three such counterfactuals for a loan scenario.

Some Microsoft counterfactuals research is detailed in a paper titled “Explaining Machine Learning Classifiers through Diverse Counterfactual Explanations” by Ramarvind K. Mothilal (Microsoft), Amit Sharma (Microsoft), and Chenhao Tan (University of Colorado). The project generated an open source code library called the Diverse Counterfactual Library (DiCE) which is available at:

The library is implemented in Python and currently supports Keras / TensorFlow models, and support for PyTorch models is being added. The researchers applied the DiCE library to the well-known benchmark Adult Data Set where the goal is to predict if a person makes less than $50,000 or more than $50,000 annually based on predictor variables such as education level, occupation type, and race.

A partial code snippet that illustrates what using the DiCE library looks like is:

import dice_ml
d = dice_ml.Data(. .)  # load dataset
m = dice_ml.Model(. .)  # load trained model
ex = dice_ml.Dice(d, m)  # create DiCE "explanation"
q = {'age': 22, 'race': 'White', . .)  # model input
# now generate 4 counterfactuals
cfs = ex.generate_counterfactuals(q, 4, . .)

The model prediction using the original input values is that the person’s income is less than $50,000. Here are the four resulting counterfactuals:

The four counterfactuals all generate a prediction that the similar person has income of greater than $50,000. For example, the first counterfactual changes the values of four predictor variables: education changes from HS-grad to Masters; age changes from 22 to 65; marital status changes from Single to Married; and sex changes from Female to Male.

Posted in Machine Learning | Leave a comment

The Determinant of a Matrix Using Recursion and C#

Working with matrices is a common task in machine learning. Most of my colleagues, and me too, have a personal library set of matrix routines. I was dusting off my personal matrix library recently and, just for fun, decided to implement a Determinant() function using recursion.

I am not a fan of recursion and I rarely use it except when working with tree data structures, and even then I avoid recursion when possible. So my recursive implementation of Determinant() was mostly for mental exercise.

If M =

 3  6
 2  7

then Det(M) = (3 * 7) – (6 * 2) = 9. If M =

 1  2  3
 4  5  6
 7  8  9
Det(M) =  (+1) * 1 *  det 5  6
                          8  9

        + (-1) * 2 *  det 4  6
                          7  9

        + (+1) * 3 *  det 4  5
                          7  8

and do on.

Anyway, after thrashing around for a few minutes, I came up with the following C# implementation of a recursive function to compute the determinant of a matrix:

static double Det(double[][] m)
  double sum = 0.0;
  int sign;  // -1 or +1

  if (m.Length == 1)
    return m[0][0];
  else if (m.Length == 2)
    return (m[0][0] * m[1][1]) - (m[0][1] * m[1][0]);

  for (int j = 0; j less-than m.Length; ++j) // each col of m
    double[][] small = new double[m.Length-1][];  // n-1 x n-1
    for (int i = 0; i less-than small.Length; ++i)
      small[i] = new double[m.Length-1];

    for (int r = 1; r less-than m.Length; ++r)  // start row [1]
      for (int c = 0; c less-than m.Length; ++c)
        if (c less-than j)
          small[r - 1][c] = m[r][c];
        else if (c greater-than j)
          small[r - 1][c - 1] = m[r][c];
        else // if (c == j)
          ; // skip this col
      } // c
    } // r

    if (j % 2 == 0)
      sign = +1;
      sign = -1;

    sum += sign * m[0][j] * Det(small); // recursive call
  } // j
  return sum;

My personal matrix library has a non-recursive Determinant() function that uses Crout’s decomposition technique.

Moral of the story: working with matrices is rather tricky but it’s an essential skill for machine learning.

Three determined puppies. Left: This is “Llama” and she is determined to get her owner’s attention. Center: This is my dog “Riley”. I was taking a nap and when I woke up, Riley was determined to get praise for chewing up my “Chess Life” magazine and three random socks. Right: This determined puppy walks softly and carries a big stick.

Posted in Machine Learning | Leave a comment

The Difference Between the Norm of a Vector and the Distance Between Two Vectors

Bottom line: It is possible to express the distance between two vectors as the norm of their difference.


v1 = (2.0, 5.0, 3.0)
v2 = (1.0, 7.0, 0.0)

The difference of two vectors is just a vector made from the difference of their components:

v1 - v2 = (2-1, 5-7, 3-0)
        = (1.0, -2.0, 3.0)

The norm of a vector is the square root of the sum of the squared components:

|| v1 || = sqrt(2^2 + 5^2 + 3^2)
         = sqrt(4 + 25 + 9)
         = sqrt(38)
         = 6.16

|| v2 || = sqrt(1^2 + 7^2 + 0^2)
         = sqrt(1 + 49 + 1)
         = sqrt(50)
         = 7.07

The Euclidean distance between two vectors is the square root of the sum of the squared differences between components:

dist(v1, v2) = sqrt( (2-1)^2 + (5-7)^2 + (3-0)^2 )
             = sqrt( 1 + 4 + 9 )
             = sqrt(14)
             = 3.74

It is possible, and common, to express Eucidean distance between two vectors as the norm of their difference:

|| v1 - v2 || = || (2, 5, 3) - (1, 7, 0) ||
              = || (1, -2, 3) ||
              = sqrt( 1^2 + (-2)^2 + 3^2 )
              = sqrt( 1 + 4 + 9 )
              = sqrt(14)
              = 3.74

In other words

dist(v1, v2) = || v1 - v2 ||

The relationhip between the norm of a vector and the Euclidean distance between two vectors appears in several machine learning scenarios. I was talking to a colleague recently. He wants to create a roadmap for software developers who want to gain machine learning knowledge and skills. This leads to the question of exactly what, if any, math background is necessary.

Knowing the roughly 100 basic math techniques for ML like the one described here is useful, but is it necessary? On the one hand, norm vs. distance is not a difficult idea and anyone can learn it on the fly. But on the other hand, if you need to pick some math knowledge up while you’re in the middle of an ML topic that uses the knowledge, it makes learning ML much more difficult.

Three paintings by artist Stanislaw Krupp. Sort of a modern Art Nouveau style. I don’t think an artist can pick up new art techniques on the fly while he’s in the middle of creating a painting, but I’m not an artist so I could be wrong.

Posted in Machine Learning | Leave a comment

The Difference Between a Code Library and a Framework

What is the difference between a code library and a code framework? Answer: The question really doesn’t make any sense because a “library” and a “framework” can mean whatever you want. There are no formal definitions of the two terms.

The terms library and framework just mean code modules that have been pre-written by you or someone else. However, it’s sometimes useful to think about how “library-ish” or how “framework-ish” some code is.

Most of my colleagues, and me too, generally think about library-ish code as being low level modules that are mostly independent of each other, where you usually edit/modify the code, and connect different library-ish modules with custom code.

We usually think about framework-ish code as being high level modules that are highly dependent on each other, where you rarely modify the code, and often use only the framework-ish code with little or no custom connecting code.

Functions in framework-ish code often have a large number of parameters because framework-ish functions aren’t easy to modify. Functions in library-ish code usually don’t have a lot of parameters — just the essentials, because you can add additional parameters and modify the code in library-ish functions relatively easily.

Here’s an example of what I’d call C# library-ish code, in a machine learning context:

double[][] trainX =
  MatLoad(".\\testData.txt", new int[] { 0, 2, 4 }, '\t');

. . .

static double[][] MatLoad(string fn, int[] cols, char sep)
  // custom code to load data into an array-of-arrays matrix

And here’s an example of what I’d call framework-ish code from ML.NET that does roughly the same thing:

using Microsoft.ML;
using Microsoft.ML.Data;
using EmpClassifier.Model.DataModels;
. . .
IDataView trainDataView = 
    hasHeader: true,
    separatorChar: '\t',
    allowQuoting: true,
    allowSparse: false);
. . .
namespace EmpClassifier.Model.DataModels
  public class ModelInput
    [ColumnName("hourly"), LoadColumn(0)]
    public bool Hourly { get; set; }
  . . .

Library-ish code uses mostly primitive data types such as int[] and double[][] while framework-ish code usually has many custom class and interface definitions. The ML.NET LoadFromTextFile() code has a hasHeader parameter. If you wanted to add such a parameter to the library-ish MatLoad() function you could do so.

Side effects of the complexity and high level of abstraction of the framework-ish approach are that framework-ish modules are often difficult or impossible to modify, and therefore framework-ish modules often force you into architecting a system in one particular way.

With library-ish code, you must have a greater knowledge of algorithms and greater coding skills.

The distinction between library-ish code and framework-ish code isn’t Boolean. Most pre-written code has various degrees of the factors I’ve described.

So, when I read or hear the question, “What’s the difference between a library and a framework?” I’m pretty sure the person who asked the question is an inexperienced developer.

The purpose of rigidly defining terms in computer science or in any field is to clarify communications. Slapping labels on concepts does not increase knowledge, and people who do so and proclaim themselves as experts are almost always egoists trying to make money. A good example of this is the computer science so-called SOLID principles — absolute meaningless nonsense.

In business, most of Six Sigma is nothing more than hilariously obscure terminology and acronyms like DMAIC which are designed to make people believe that Six Sigma is something more than just a few useful concepts that can be explained and learned in two hours. And “Agile” programming is just a few common-sense ideas but is typically surrounded by massive amounts of lame terminology intended to enable bogus training. (Wow! When did I become the cranky old programmer guy?)

Posted in Machine Learning, Miscellaneous | Leave a comment