The Worst Logistic Regression Graph Diagram on the Internet

Argh! I have to post on this topic.

Strewn throughout the Internet is a graph that is supposed to explain what logistic regression is and how it works. I’ve seen this graph, and variations of it, for years and it has been blindly copied dozens of times. And it is so completely wrong.

Here are two common versions of the horrible graph I’m talking about:

Examples of graphs that are supposed to explain logistic regression. They are completely wrong.

The graphs are worse than meaningless. They’re actively misleading.

I created an example and two diagrams that correctly illustrate what logistic regression is. I set up 10 dummy items where the goal is to predict if a person is male (class 0) or female (class 1) based on just two predictor variables, x0 = Age and x1 = Income. I plotted the data on the top graph. This was possible only because there are just two predictor variables — if there were three or more I couldn’t have made a 2D graph even though logistic regression works for any number of predictor variables. There are two colors for the dots because logistic regression is a binary classifier technique.

The top graph is training data for a logistic regression problem. The bottom graph is logistic regression for the data.

Logistic regression is designed to handle data that is mostly linearly separable, as is the case for the dummy data.

NOTE: When data is completely linearly separable, as here, there are two huge problems. First, there are an infinite number of solution weights and biases. Second, if you use some form of simple stochastic gradient descent, the weights and biases can grow towards plus or minus infinity. These two problems are very complex in theory. In practice there are easy ways to deal with data that is completely linearly separable.

The bottom graph illustrates how logistic regression works. First you find a weight for each variable and a bias value. I used one of dozens of training techniques and got w0, the weight for age x0, equal to 13.5. I got w1, the weight for income x1, equal to -12.2. I got a bias value of 1.12.

For each data item, you compute a predicted class in two steps. First z = (w0 * x0) + (w1 * x1) + b. Then p = 1 / (1 + exp(-z)). If p is less than 0.5 the predicted class is 0 (male), otherwise if p is greater than 0.5 the predicted class is 1 (female).

The equation for p is called the logistic sigmoid function. It is an “S” shaped curve where z on the horizontal axis runs from minus infinity to plus infinity, and p on the vertical axis s always between 0 and 1. The logistic sigmoid function always looks exactly the same. The predicted p values for each data item will always lie exactly on the line of the graph of the function, as shown. Dots below 0.5 (the red dashed line) are class 0, dots above 0.5 are class 1.

So, the horrible graph you will see plastered everywhere on the Internet is an incorrect combination of plotting the data items together with the logistic regression function.

People who put the bad graph, or a version of it, on their blog sites clearly do not fully understand logistic regression.

Machine learning is not simple. Anyone with reasonably good math skill can learn ML but it requires a lot of study.

Shown below are three of the most famous graphs in history.

Top: Anscombe’s Quartet (1973) shows four datasets. All four datasets have identical linear regression coefficients, x and y means, x and y variance, and Pearson Correlation Coefficients. The point is that sometimes statistics by themselves aren’t enough to describe a dataset.

Center: Murray’s Bell Curve (1994) shows the IQ of two different groups. The point is that the difference in intelligence between groups is surprisingly large (about a full standard deviation) and there are many interpretations of what this intelligence gap means.

Bottom: Snow’s London Cholera map (1854) shows the sources of a cholera outbreak in London. It revealed that there were many deaths near a water pump on Broad Street, which suggested that cholera might be spread by contaminated water (it was later determined it is).

Posted in Machine Learning | 2 Comments

Neural Network Lottery Ticket Hypothesis: The Engineer In Me Is Not Impressed

The neural network lottery ticket hypothesis was proposed in a 2019 research paper titled “The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks” by J. Frankle and M. Carbin. Their summary of the idea is:

We find that a standard pruning technique naturally uncovers subnetworks whose initializations made them capable of training effectively. Based on these results, we articulate the “lottery ticket hypothesis:” dense, randomly-initialized, feed-forward networks contain subnetworks (“winning tickets”) that – when trained in isolation – reach test accuracy comparable to the original network in a similar number of iterations. The winning tickets we find have won the initialization lottery: their connections have initial weights that make training particularly effective.

Let me summarize the idea in the way that I think about it:

Huge neural networks with many weights are extremely time consuming to train. It turns out that it’s possible to train a huge network, then prune away weights that don’t contribute much, and still get a model that predicts well.

The lottery ticket idea has limited usefulness because you start by training a gigantic neural network. Then you prune away some weights. This helps a bit at inference time when the trained model is used to make predictions, but running input through a trained model doesn’t usually take much time so not much is gained. The idea is useful from a theoretical point of view — knowing that huge neural networks can in fact be compressed without sacrificing very much prediction accuracy means that maybe it’s possible to find a good compressed neural network before training rather than after training.

First two pages from the lottery ticket hypothesis paper.

I’ve seen three research ideas for compressing a neural network before training. The first paper is “SNIP: Single-Shot Network Pruning Based on Connection Sensitivity” (2019) by N. Lee, T. Ajanthan, and P. Torr. The idea is to run training data through network once, find gradients with small values, then delete the associated weights.

The second paper is is “Picking Winning Tickets Before Training by Preserving Gradient Flow” (2020) by C. Wang, G. Zhang, and R. Grosse. Their idea is basically a refinement of the SNIP paper. The idea is to use second derivatives to estimate the effect of dropping a weight after pruning, rather than before pruning as in the SNIP technique.

The third paper is “Initialization and Regularization of Factorized Neural Layers” (2021) by M. Khodak, N. Tenenholtz, L. Mackey, and N. Fusi. The idea is to factor each (large) weight matrix into two (smaller) weight matrices using singular value decomposition. The two smaller matrices of weights can be trained more quickly than the single large matrix of weights, but requires some tricky coding.

I speculate that at some point in the future, quantum computing will become commonplace, and when that happens, the need for compressing huge neural networks will go away. But until quantum computing arrives (and I think it will be later rather than sooner), work on compressing neural networks will continue.

The “lottery ticket hypothesis” phrase is catchy and memorable. But if you think about it carefully, the phrase really doesn’t have much to do with the ideas presented in the research paper. But researchers need to market and advertise their work just like anyone else. Here are three examples of product marketing names that didn’t turn out very well. Left: “Terror” brand liquid soap. Center: “Painapple Candy”. Right: “Tastes Like Grandma” jam.

Posted in Machine Learning | Leave a comment

Scott’s Pi for Inter-Rater Reliability

Scott’s pi is one of many classical statistics metrics that can be used to measure how well two raters agree when they rate a set of items. Scott’s pi, like other inter-rater reliability metrics, is used for a very specific problem scenario. I’ll explain by example. Note that assigning a rating is not the same as ranking a set of items from best to worst.

Suppose you have two raters (or judges, or “coders” in classical stats terminology) who rate the quality of life in the 50 states of the U.S. as excellent, good, fair, poor. Your raw data might look like:

# state    rater1      rater2
# -------------------------------
Alabama    fair        good
Alaska     poor        good
. . .
Wisconsin  excellent   excellent
Wyoming    good        fair

The first line of data means rater1 judged Alabama as fair, and rater2 judged Alabama as good. A perfect score for Scott’s pi would be 1.000 if both raters agreed exactly on all 50 states. A Scott pi value close to 0.000 means very little agreement.

To summarize, Scott’s pi is applicable if you have exactly two raters, and a bunch of items that are placed into one of a few discrete categories (“nominal data”) by the raters.

I hadn’t really looked at Scott’s pi since my days as a college professor so I refreshed my memory of Scott’s pi by working an example in Excel. The top matrix holds the raw ratings. For example, the 2 in the first row means that there were 2 states where Rater1 assigned Fair and Rater2 assigned Excellent. There are 50 pairs of ratings, which means there were 50 * 2 = 100 decisions made.

Notice that the entries on the diagonal are the number of times that Rater1 and Rater2 agreed. If there was perfect agreement, all the cells off the diagonal would be 0. The P(observed) is the proportion of agreements that actually happened. It’s calculated as the sum of the valueson the diagonal of the raw data, divided by the number of data items (50). For the example, P(observed) = (4 + 6 + 3 + 5) / 50 = 18 / 50 = 0.36. Put another way, the two raters agreed on 36% of the items.

The bottom matix is used to calculate P(expected), which is the proportion of agreements you’d expect if ratings were random.

The first column (10, 17, 12, 11) holds the totals for each category asigned by Rater1. The second column (11, 16, 13, 10) holds the totals for Rater2. The joint proportion (JP) for a category is the sum of the rater totals divided by the total number of decisions made (100). For example, the JP for the Excellent category is (10 + 11) / 100 = 0.21. The fourth column holds squared JP values.

The P(expected) = 0.2596 and it’s calculated as the sum of the squared JP values.

Scott’s pi value compares P(observed) and P(expected) like so:

pi = [P(obs) - P(expected)] / [1 - P(expected)]
   = (0.36 - 0.2596) / (1 - 0.2596)
   = 0.1004 / 0.7404
   = 0.136

The calculation is not obvious at first, makes sense if you think about it for a bit. Notice that if there is perfect agreement between the two raters, P(observed) will be 1.00 and no matter what P(expected) is, pi = (1.00 – any) / (1 – any) = 1.000.

I found an online inter-rater reliability calculator that does Scott’s pi so I used it to verify my Excel example. The hardest part about using the online calculator was setting up the data file in the correct format. I had to encode Excellent = 4, Good = 3, and so on.

Anyway, good fun. When I get some free time, maybe I’ll code up an implementation using Python. It won’t be difficult — if you can work a problem in Excel, you can almost always translate to a Python program quite easily.

In general I’m not a fan of pop art style illustrations, but I like these three examples. Left: By artist Chamnan Chongpaiboon. I rate it Excellent. Center: By artist Shreya Bhan. I rate it Good. Right: By artist Michael Eyal. I rate it as Good.

Posted in Miscellaneous | Leave a comment

Knowing When To Stop Training a Generative Adversarial Network (GAN)

A generative adversarial network (GAN) is a deep neural system that is designed to generate fake/synthetic data items. A GAN has a clever architecture made of two neural networks: a generator that creates fake data items, and a discriminator that classifies a data item as fake (0) or real (1). GANs are most often used to generate synthetic images, but GANs can generate any kind of data.

Training a GAN is quite difficult. There are twice as many hyperparameters to deal with (number of hidden layers, number of nodes in each layer, activation function, batch size, optimization algorithm, learing rate, batch size, loss function and so on) as there are for a regular neural network.

Knowing when to stop training a regular neural network is difficult and usually involves looking at the value of the loss function during training. I’ve been experimenting with GANs and wondered how to know when to stop training. Looking at the loss values of the generator and the discriminator won’t really work because the generator is constantly trying to create fake data to fool the discriminator, while at the same time the discriminator is learning how to tell fake data from real data.

I did a simple thought experiment and figured that if the discriminator was able to distinguish fake data items from real data items with about 50% accuracy, then the generator is doing a good job of creating fake data items.

So I coded up an experiment with a function that computes the prediction accuracy of the discriminator, based on n fake data items produced by the generator. I used PyTorch, my neural code library of choice but the same ideas can be used in Keras or TensorFlow. In pseudo-code:

loop n times
  use generator to create a fake data item
  feed fake image to discriminator, get result p
  if p "less-than" 0.5 then
    num correct += 1  # determined it was fake
    num wrong += 1    # thought it was real
return n_correct / (n_correct + n_wrong)

The code implementation is:

def Accuracy(gen, dis, n, verbose=False):
  # accuracy of discriminator on n fake images from generator
  n_correct = 0; n_wrong = 0

  for i in range(n):
    zz = T.normal(0.0, 1.0,
      size=(1, gen.inpt_dim)).to(device)  # 20 values
    fake_image = gen(zz)  # one fake image
    pp = dis(fake_image)  # pseudo-prob
    if pp "less-than" 0.5:
      n_correct += 1      # discriminator knew it was fake
      n_wrong += 1        # dis thought it was a real image

    if verbose == True:

  return (n_correct * 1.0) / (n_correct + n_wrong)

I ran the experiment code on a GAN that creates synthetic ‘3’ digits based on the UCI Digits dataset. Each ‘3’ digit is a crude 8×8 grayscale image of a handwritten digit.

The results were quite satisfactory. The accuracy of the discriminator started out near 100% as expected because the generator hadn’t learned to make good fake images yet. As training continued, the accuracy of the discriminator slowly went down as the generator got better.

Anyway, very interesting and good fun.

Animated films are synthetic versions of reality. I like several stop motion animation films, including these three. Left: “Coraline” (2009) is sort of a dark, modern day Alice in Wonderland. Great story, great animation. Center: “Isle of Dogs” (2018) is a fantastically creative story that’s difficult to describe. An amazing film. Right: “James and the Giant Peach” (1996) is an adaptation of a story from the ultra-inventive mind of author Roald Dahl (1916-1990).

Posted in PyTorch | Leave a comment

A Quick Look at Uno Platform Development

The Uno platform is a software library that allows software developers create an application that targets Android devices, iOS devices, Windows devices, and Web applications. Put another way, using Uno, a software developer can write a single application that will run on . . . just about anything.

Cross-platform software development, in many different forms, has been a goal for decades. The Java programming language touted “write once, run anywhere” in the 1990s. At a lower level, in the 1960s the C programming language was designed so that developers wouldn’t have to write programs using different assembly languages. The Uno platform brings this idea to a high level of abstraction.

One of the ways I try to keep up with new developments in software technologies is to review technical book manuscripts. One of my favorite publishers to review for is Syncfusion. They have published many excellent e-books that are freely available. Recently, I was performing a tech review and edit for a new e-book titled “Uno Succinctly” and I learned a lot about the Uno platform.

While I was preparing the get started on the Uno manuscript review, I found an excellent, in depth video by Martin Zikmund on YouTube. See I captured five images from the presentation for this blog post.

Left: You can develop separate applications for each platform. Right: Libraries like Xamarin allow you to write one set of code for the logic but still require separate sets of code for different platform UI.

The first image illustrates the most basic approach to cross-platform development: create four different versions of an application. Typically the logic code would be written in C# or Kotlin or Objective C or Swift or JavaScript. The UI code would be written in XAML, or Android XAML or Swift UI or HTML with CSS. There are of course, dozens of alternative languages.

The second image shows a partial solution to the cross-platform development problem: use a library such as Xamarin so that a single set of the logic code could be deployed on various devices and OSes. But you’d still have to write different UI code for different devices.

Left: The goal of Uno is one code base for all platforms. Right: An example of an Uno application – all the UIs look very much the same. (cue sound of me yawning).

The third image shows the goal of the Uno platform: one platform that allows developers to write once, run anywere. Quite an aspiration.

The fourth image shows an application developed using Uno, where the UI on different systems looks nearly the same.

The fifth image below shows an example of Uno development in action. The references shown are just a tiny tip of the iceberg — there are literally thousands of code dependencies involved with Uno development.

I am usually optimistic about technologies, but I’m highly skeptical of any multipurpose tool intended for all cross-platform development. I’ve worked on cross-platform development and systems such as Uno have to be insanely complex. In my experience, I spent far more time debugging the cross-platform development system than I did debugging the application under development. In many situations, a cross-platform development solution creates more problems than it solves. Sometimes the simplest solution is the best — just bite the bullet and write and test and maintain different code bases. Painful, but it always works.

My opinions are strongly influenced by my distaste for writing UI code. I am much more interested in data structures and algorithms that I am in placing a UI button on an applcation. I am happiest when the code I write will run in a shell.

So, if you are in a situation where cross-platform development is required, you should check out Uno. But if you’re new to cross-platform development, beware: it is a very difficult, tedious, and annoying environment to work in.

Keep an eye out for the free “Uno Succinctly” e-book from the Syncfusion company. It should be published in June or July 2021.

A chatelaine is a type of jewelry that is rarely worn anymore. A chatelaine was worn from a woman’s waist. Originally, chatelaines were purely functional and held keys because Victorian era dresses in the 1800s did not have pockets. Later, various useful items such as thimbles, small scissors, mirrors, watches, and so on were added. Chatelaines were a multipurpose tool in some sense. As time went by, chatelaines became more decorative and jewelry-like and less functional.

Posted in Miscellaneous | Leave a comment

Implementing Kullback-Leibler Divergence from Scratch Using Python

The Kullback-Leibler divergence is a number that is a measure of the difference between two probability distributions. I wrote some machine learning code for work recently and I used a version of a KL function from the Python scipy.stats.entropy code library. That library version of KL is very complex and can handle all kinds of scenarios. But my problem scenario was very basic so I wondered how difficult it would be to implement KL from scratch using Python. It turned out to be very easy.

I reviewed my knowledge of KL by going to the Wikipedia article on the topic. The Wikipedia article was excellent — for me anyway. Whether or not a Wikipedia article on a technical topic is good for a particular person depends entirely on that person’s background knowledge of the topic.

The Wikipedia article on Kullback-Leibler Divergence is excellent.

Anyway, the Wikpedia article gave a worked example of KL. Hooray! (Wikipedia authors — please always give a worked example of an algorithm or metric!) So I decided to replicate that example. The example set up a first distribution with three values: P = (9/25, 12/25, 4/25) and a second distribution: Q = (1/3, 1/3, 1/3). The value of KL(P,Q) = 0.085300 and the value of KL(Q,P) = 0.097455.

Implementing KL from scratch was very easy because I only needed a version for discrete distributions and I didn’t need to do any error checking. For example, if any of the cells of the Q distribution are 0 you’d get a divide by zero exception. Or if the values in either P or Q don’t sum to 1, you’ll get an incorrect result.

Small values of KL indicate the two distributions are similar, and larger values indicate greater difference. If KL = 0 the two distributions are the same. KL is not symmetric so KL(P,Q) != KL(Q,P) in general. To get rid of this minor annoyance, you can compute KL in both directions and then either sum, or take the average.

I coded up KL from scratch and ran it with the Wikipedia example data and got the same results.

There’s a trade-off between using a function from a code library and implementing the function from scratch. I prefer implementing from scratch when feasible. Early in my career as a software developer, I worked on complex projects that used C++ with COM technology. We used many code libraries. Dependency problems, incuding DLL Hell, were an absolute nightmare. The more things you implement from scratch, the fewer dependencies you have. So if a function can be implemented from scratch quickly and reliably, I’ll use that strategy.

Sometimes doing things yourself is not a good strategy. Left: TV on-off switch repair. Center: New electrical outlet added — questionable location. Right: One way to prevent a circuit breaker from tripping.

# Kullback-Leibler from scratch

import numpy as np

def KL(p, q):
  # "from q to p"
  # p and q are np array frequency distributions

  n = len(p)
  sum = 0.0
  for i in range(n):
    sum += p[i] * np.log(p[i] / q[i])
  return sum

def main():
  print("\nBegin Kullback-Liebler from scratch demo ")
  np.set_printoptions(precision=4, suppress=True)

  p = np.array([9.0/25.0, 12.0/25.0, 4.0/25.0], dtype=np.float32)
  q = np.array([1.0/3.0, 1.0/3.0, 1.0/3.0], dtype=np.float32)

  print("\nThe P distribution is: ")
  print("The Q distribution is: ")

  kl_pq = KL(p,q)
  kl_qp = KL(q, p)

  print("\nKL(P,Q) = %0.6f " % kl_pq)
  print("KL(Q,P) = %0.6f " % kl_qp)

  print("\nEnd demo ")

if __name__ == "__main__":
Posted in Machine Learning | Leave a comment

Positive and Unlabeled Learning (PUL) Using PyTorch

I wrote an article titled “Positive and Unlabeled Learning (PUL) Using PyTorch” in the May 2021 edition of the online Microsoft Visual Studio Magazine. See

A positive and unlabeled learning (PUL) problem occurs when a machine learning set of training data has only a few positive labeled items and many unlabeled items. For example, suppose you want to train a machine learning model to predict if a hospital patient has a disease or not, based on predictor variables such as age, blood pressure, and so on. The training data might have a few dozen instances of items that are positive (class 1 = patient has disease) and many hundreds or thousands of instances of data items that are unlabeled and so could be either class 1 = patient has disease, or class 0 = patient does not have disease.

The goal of PUL is to use the information contained in the dataset to guess the true labels of the unlabeled data items. After the class labels of some of the unlabeled items have been guessed, the resulting labeled dataset can be used to train a binary classification model using any standard machine learning technique, such as k-nearest neighbors classification, neural binary classification, logistic regression classification, naive Bayes classification, and so on.

PUL is challenging and there are several techniques to tackle such problems. I created a synthetic dataset of Employee information where the goal is to predict if an employee is an introvert (0) or extrovert (1). There are 200 data (employee) items. Only 20 are labled as positive = extrovert = 1. The other 180 dara items are unlabeled and could be positive, or negative = introvert = 0.

The demo program repeatedly (eight times) trains a helper binary classifier using the 20 positive employee data items and 20 randomly selected unlabeled items which are temporarily treated as negative. Expressed in pseudo-code:

create a 40-item train dataset with all 20 positive
  and 20 randomly selected unlabeled items that
  are temporarily treated as negative
loop several times
  train a binary classifier using the 40-item train data
  use trained model to score the 160 unused unlabeled
    data items
  accumulate the p-score for each unused unlabeled item
  generate a new train dataset with the 20 positive
    and 20 different unlabeled items treated as negative

for-each of the 180 unlabeled items
  compute the average p-value 

  if avg p-value  hi threshold
    guess its label as positive
    insufficient evidence to make a guess

Somewhat unexpectedly, the most difficult part of a PUL system is wrangling the data to generate dynamic (changing) training datasets. The challenge is to be able to create an initial training dataset with the 20 positive items and 20 randomly selected unlabeled items like so:

train_file = ".\\Data\\employee_pul_200.txt"
train_ds = EmployeeDataset(train_file, 20, 180)

And then inside a loop, be able to reinitialize the training dataset with the same 20 positive items but 20 different unlabeled items:


Because the PUL guessing process is probabilistic, there are many approaches you can use. The technique presented in the article is based on a 2013 research paper by F. Mordelet and J.P. Vert, titled “A Bagging SVM to Learn from Positive and Unlabeled Examples”. That paper uses a SVM (support vector machine) binary classifier to analyze unlabeled data. The article uses a neural binary classifier instead. The approach presented in the article is new and mostly unexplored.

Here are three old devices and descriptions of what they did. All three devices are real, but only one description is true; the other two descriptions I made up and are completely false. Can you guess which is the positive label/description? Answer below.

Left: This is a woman wearing a “brank”, a device husbands put on wives who talked too much. It had a bit similar to the one on a horse bridle.
Center: This is a device called a “Johnson passer” used in London to put a divider down the middle of a street. The person riding in the cab released small white pebbles through the tube in the back.
Right: This is a device used by fishermen called a “bobblet”. They would stick the bill into the water and blow, creating bubbles that would attract crabs and eels, which were then speared.

Posted in Machine Learning, PyTorch | Leave a comment

Simple Ordinal Classification Using PyTorch

I was chatting with some of my colleagues at work about the topic of ordinal classification, also known as ordinal regression. An ordinal classification problem is a multi-class classification problem where the class labels to predict are ordered, for example, “poor”, “average”, “good”.

The problem scenario is best explained by example. Suppose you want to predict the price of a house, where a house price is an ordinal value (0 = low, 1 = medium, 2 = high, 3 = very high) rather than a numeric value such as $525,000. There are dozens of rather complicated old machine learning techniques for ordinal classification that are based on logistic regression. But using a neural network approach is easy and effective. I wrote a demo program using PyTorch to demonstrate.

Continuing with the ordinal house price example, you define a neural network that has one output node. You use logistic sigmoid activtion on the output node so that a computed output value is between 0.0 and 1.0. Then:

output between 0.00 and 0.25 correspond to class 0 (low price)
output between 0.25 and 0.50 correspond to class 1 (medium)
output between 0.50 and 0.75 correspond to class 2 (high)
output between 0.75 and 1.00 correspond to class 3 (very_high)

Now to train the network, if a training item house is class 0, you want to define a loss function so that the network adjusts its weights to make the output close to the center of its range. This is halfway between 0.00 and 0.25 = 0.125. Similarly:

training label    output node target
     0              0.125
     1              0.375
     2              0.625
     3              0.875

If there are k=4 ordinal classes, and if a training item has class 0 as a target, the computed output of the neural network should be 0.125. And so on.

Therefore the neural network loss function compares the target value from the training data (0 to 3) with the values in the table above. I used mean squared difference. For this example, the number of ordinal classes is k = 4. If t is the target label (0 to 3), the “output node target” values are computed as (2 * t + 1) / (2 * k). For example, if t = 3, then (2 * t + 1) / (2 * k) = (2 * 3 + 1) / (2 * 4) = 7/8 = 0.875 as shown.

I implemented this idea for ordinal classification loss like so:

def ordinal_loss(output, target, k):
  # loss = torch.mean((output - target)**2)  # MSE
  loss = T.mean((output - ((2 * target + 1) / (2 * k)))**2)
  return loss

For a specific problem, the number of class labels will be fixed, so you could just hard-code the target values in an array in the loss function, such as targets = np.array([0.125, 0.375, 0.625, 0.875]).

My demo program used 200 synthetic data items for training. The data looks like:

AC    sq. feet  style     price   school
-1    0.3075    1  0  0    3      0  1  0
-1    0.2700    1  0  0    2      0  0  1
 1    0.1700    0  1  0    1      0  0  1
-1    0.1475    1  0  0    1      1  0  0
 1    0.2000    1  0  0    2      1  0  0
-1    0.1100    0  0  1    0      1  0  0
. . .

The predictors are air conditioning (-1 = no, 1 = yes), area in square feet (normalized), style (art_deco, bungalow, colonial), and local elementary school (johnson, kennedy, lincoln).

My results were very good.

However, I have some questions in my mind. I spent quite a bit of time searching the Internet for “ordinal regression” and “ordinal classification” and found all kinds of very complicated techniques, but I didn’t find the idea I used for my demo. This idea was the very first thing that popped into my head, and it’s very obvious. I don’t know why I didn’t find any information about this technique — I thought that someone surely must of investigated the idea.

So, there are three possibilities. First, my idea for ordinal classification could have some fatal logic flaw I’m not seeing and I just got very lucky with my demo. Second, maybe nobody has ever tried my idea before because it requires creating a custom neural loss function, which sounds scary (but isn’t). Third, perhaps the technique has been tried and is well known, but is called by some special name and so I didn’t find it during my Internet research.

I’ll continue exploring ordinal classification to see if I can solve the mysterious situation.

There were a lot of Mysterious movies in the 1930s and 40s.

Left: “Mysterious Mr. Moto” (1938) features a clever Japanese detective played by actor Peter Lorre. Moto infiltrates a gang of assassins to stop an evil plot.

Left Center: “The Mysterious Miss X” (1939) is a story about two out-of-work actors who are mistaken for detectives. They solve the murder of a rich businessman (it was the lawyer) and one finds romance with the dead man’s daughter.

Right Center: “The Mysterious Dr. Fu Manchu” (1929) tells the origin story of the evil Chinese mastermind (played by Warner Oland, who later played detective Charlie Chan throughout the 1930s). Fu Manchu attempts to murder the people he believes are responsible for his wife’s death. He is thwarted by Scotland Yard Inspector Nayland Smith and his assistant Dr. Jack Petrie.

Right: “The Mysterious Mr. M” (1946) takes place after WWII when criminal Anthony Waldron has developed a mind-control drug and he intends to use it to steal plans for a submarine engine. A mysterious villain named Mr. M appears and muscles in on the action. Agent Grant Farrell eventually stops both evil Waldron and evil Mr. M, who turns out to be Waldron’s sister.

Posted in PyTorch | Leave a comment

Logistic Regression Using PyTorch With L-BFGS Optimization

The PyTorch code library was designed to enable the creation of deep neural networks. But you can use PyTorch to create simple logistic regression models too. Logisitic regression models predict one of two possible discrete values, such as the sex of a person (male or female).

Training a neural network is the process of finding good values for the weights and biases, which are constants like -1.2345, that define the behavior of the network. By far the most common way to train a neural network is to use stochastic gradient descent combined with either MSE (mean squared error) or BCE (binary cross entropy) loss. If you create a logistic regression model using PyTorch, you can treat the model as a highly simplified neural network and train the logistic regression model using stochastic gradient descent (SGD). But it’s also possible to train a PyTorch logistic regression model using an old technique called L-BFGS.

The advantages and disadvantages of using SGD are: works with simple or complex neural architectures, can train in batches which allows very large datasets, but SGD requires tuning the learning rate and batch size parameters, which can be difficult and time consuming.

The advantages and disadvantages of L-BFGS are: converges in very few iterations and so is blazingly fast, parameter tuning usually not necessary, but all data must be stored in memory so L-BFGS doesn’t work with very large datasets (there are some complex work-arounds to this however).

I set out to extend my knowledge of PyTorch by creating a logistic regression model and training it using L-BFGS. There are severral differences between using SGD and using L-BFGS. The most important difference is that to use L-BFGS you must define a closure() function. Loosely speaking, a closure() function is a function defined inside another function. The closure() function computes the loss and is used by L-BFGS to update model weights and biases. It would have taken me many hours to figure this out by myself but luckily the PyTorch documentation had an example code fragment that put me on the right path.

I wrote a demo program. Here is the key code that trains the logistic regression model:

def train(log_reg, ds, bs, mi):
  # model, dataset, batch_size (must be all), max iterations
  loss_func = T.nn.BCELoss()  # binary cross entropy
  opt = T.optim.LBFGS(log_reg.parameters(), max_iter=mi)
  train_ldr =,
    batch_size=bs, shuffle=False)  # shuffle irrelevant

  print("\nStarting L-BFGS training")

  for itr in range(0, mi):
    itr_loss = 0.0            # for one iteration
    for (_, all_data) in enumerate(train_ldr):  # b_ix irrelevant
      X = all_data['predictors']  # all inputs
      Y = all_data['sex']         # all targets

      # -------------------------------------------
      def closure():
        oupt = log_reg(X)
        loss_val = loss_func(oupt, Y)
        return loss_val
      # -------------------------------------------

      opt.step(closure)  # get loss, use to update wts
      oupt = log_reg(X)  # monitor loss
      loss_val = closure() 
      itr_loss += loss_val.item()  
    print("iteration = %4d   loss = %0.4f" % (itr, itr_loss))

  print("Done ")

There is a lot going on here. L-BFGS uses gradients but in a different way from SGD and so you don’t have to deal with setting the eval() and train() modes. There are other differences too, so if you want to use L-BFGS yourself, be prepared to spend a few hours with the PyTorch documentation.

Naming the local function closure() isn’t very descriptive — perhaps loss_closure() would be better — but the PyTorch documentation used “closure()” so I used that name too.

My demo program creates a model that predicts the sex of a hospital patient based on their age, county of residence (one of three), blood monocyte count, and hospitalization history (minor, moderate, major). The prediction accuracy results of a model trained with L-BFGS were about the same as the best results I got on the model trained using SGD, but I had to spend quite some time tuning the SGD-trained model whereas the model trained using L-BFGS gave pretty good results immediately.

My conclusion: In scenarios where you create a logistic regression model using PyTorch, if your training data can fit into memory, using L-BFGS instead of SGD is a good approach. There are many small differences when using L-BFGS. For example, because you use the entire training data instead of batches, the shuffle parameter in DataLoader can be set to True.

Left: The shuffle dance is a joyful style that was invented in Australia. It reminds me a bit of Irish clog dancing. Here are two girls who shuffle dance up a set of stairs in unison. Very cool.

Center: Shuffle Master machines dominate the automatic card shuffling market. I’ve seen the inner workings of these machines and they’re quite remarkable.

Right: When I worked on a cruise ship as an Assistant Cruise Director years ago, one of my duties was to organize and referee the daily shuffleboard tournament. It was very popular with passengers. I’m wearing the flashy red pants and concentrating — the participants took the game seriously and were often very competitive (in a good way).

Code below (very long) Continue reading

Posted in PyTorch | Leave a comment

Combining Two Different Logistic Regression Models by Averaging Their Weights

I was in a meeting recently and one of my colleagues briefly described some work he had done at a previous job. He had an enormous set of training data and wanted to train a logistic regression model.

Logistic regression is a binary classification technique and is one of the simplest forms of machine learning. Suppose you want to predict if a person is male (class 0) or female (class 1) based on age, income, and height. When you train a LR model you will get one weight for each predictor variable and one bias. You multiply each predictor value times its weight, add the bias, then apply the logistic sigmoid function. The result will be a pseudo-probability value between 0 and 1. If the p-value is less than 0.5 the prediction is class 0, and if the p-value is greater than 0.5 the prediction is class 1.

For example, suppose age = 0.29, income = 0.5400, height = 0.72 and the weights are w1 = 1.7, w2 = -1.4, w3 = -0.5, and the bias is 0.2 then:

z = (0.29 * 1.7) + (0.5400 * -1.4) + (0.72 * -0.5) + 0.2
  = -0.4230

p = sigmoid(-0.4230)
  = 1.0 / (1.0 + exp(-z))
  = 1.0 / (1.0 + exp(0.4230))
  = 0.3958

  = class 0
  = male

There are many algorithms yo can use to find the weights and bias for logistic regression. Common techniques include: stochastic gradient descent, L-BFGS, Nelder-Mead, and iterated Newton-Raphson. The implementations of all these techniques generally assume that the training data is small enough to fit into memory (perhaps a million or so items or less).

If you have a huge set of training data that won’t fit into memory, then you have a problem. One approach is to write data-loading code that will stream training data into memory as needed. Another approach is to break the huge file down to several smaller files, train a logistic regression model on each smaller file, and then combine the separate prediction models in some way.

One way to combine separate models is to maintain separate models and use a voting scheme. Another approach is to maintain separate models and average the p-values from each model.

Yet another approach is to create a meta-model that accepts the p-values from the sub-models and then combines them, typically by using logisic regression again.

Anyway, and now I’m finally getting to the point, I think my colleague suggested that he trained separate models and then combined them into a single model by averaging the models’ weights and biases.

I was intrigued so the next day I searched the Internet looking for an research on combining logistic regression models by averaging weights and biases. I didn’t find any solid evidence (in my opinion anyway). I did find several opinions on sites like Stack Overflow, but I know from previous experience, that many machine learning opinions are completely wrong.

Left: Two logistic regression models trained on 100-item datsets and then a combined logistic regression model created by using the average weighrs and biases of the two small models. Right: A single logisitc regression model trained on all 200 data items seemed to work better.

So, I set out to do an experiement to try and gain some insight. Bottom line: the technique of combining logistic regression models by averaging weights and biases did not work well in my one experiemnt, but the results were not conclusive.

For my experiment, I used the PyTorch neural code library. I started with 200 synthetic patient data items for training and 40 for testing. Each item had a patient sex (male = 0, female = 1), age, county (one of three), monocyte count, and hospitalization history (minor, moderate, major). The goal is to predict sex from the other variables. The data looks like:

1	0.58	0	1	0	0.6540	0	0	1
0	0.39	0	0	1	0.5120	0	1	0
1	0.24	1	0	0	0.2950	0	0	1
0	0.31	0	1	0	0.4640	1	0	0
. . .

For the combined logisitc regression model, I divided the 200 training items into two 100-item sets. I trained a first logistic regression model on the first set of data, then trained a second model on the second set of data. After training, I created a third logistic regression model and set its weights and bias to the averages of the two separate models. I applied the combined model on the test data and got 62.5% accuracy.

To check this, I created a single logistic regression model and trained it on all 200 data items. When I applied this model to the test data, it achieved 75% accuracy — quite a bit better.

I think I understand why the averaged weights combined model doesn’t work well. When you train a logistic regression model, somewhat surprisingly, there are many different sets of weights and bias that will give you very similar answers. A large value in one weight can be balanced by moderate values in two other weights. So when you train separate models, you will get different sets of weights which work well on their own small dataset, but averaging the weights gives mushy weights that work OK but not well.

As always, there are dozens of factors that could be confounding my experiment. However, if I were forced to create a logistic regression model using a huge set of training data tomorrow, my first choice would be to write a streaming data loader, and my second choice would be to create separate models but average the models’ output p-values.

Two things that I’d like to explore are 1.) using a much larger dataset, 2.) using a more consistent training algorithm — L-BFGS instead of stochastic gradient descent.

Interesting stuff.

Mixed media combines different techniques to produce a single work of art. When the technique succeeds, it can create art that’s more appealing than the separate techniques. Here are three mixed media portraits by artists I like. Left: By Graeme Stevenson. Center: By Hans Jochem Bakker. Right: By Andrea Matus Demeng.

Posted in PyTorch | Leave a comment