The PyTorch log_softmax() Function

Working with deep neural networks in PyTorch or any other library is difficult for several reasons. One reason is that there are a huge number of low-level details. For example, when creating a multi-class classifier you have two common design options (there are many less-common options too). Option #1: Use log_softmax() activation on the output nodes in conjunction with NLLLoss() when training (“negative log-likelihood loss”). Option #2: You can use no activation on the output nodes (or equivalently, identity() activation) in conjunction with CrossEntropyLoss() when training.

I give fairly detailed examples of the two approaches at https://jamesmccaffrey.wordpress.com/2020/06/11/pytorch-crossentropyloss-vs-nllloss-cross-entropy-loss-vs-negative-log-likelihood-loss/.



# log_soft_demo.py
# Python 3.7.6 (Anaconda3-2020.02)
# PyTorch 1.6.0  Windows 10

import torch as T
device = T.device("cpu")

print("\nBegin softmax and log_softmax() demo \n")

t1 = T.tensor([1.0, 3.0, 2.0], dtype=T.float32).to(device)
sm = T.nn.functional.softmax(t1, dim=0)
lsm = T.nn.functional.log_softmax(t1, dim=0)
l_sm = T.log(T.nn.functional.softmax(t1, dim=0))

T.set_printoptions(precision=4)
print("tensor t1        = ", end=""); print(t1)
print("softmax(t1)      = ", end=""); print(sm)
print("log_softmax(t1)  = ", end=""); print(lsm)
print("log(softmax(t1)) = ", end=""); print(l_sm)

print("\nEnd demo ")

I computed softmax() and log_softmax() and log(softmax) of [1.0, 3.0, 2.0] using Excel, and then again using PyTorch.


Now on the one hand, this is all the information that is needed to implement a PyTorch multi-class classifier. But behind the scenes there are many details. These details can be confusing if you have a semi-theoretical knowledge of neural network — meaning, what about softmax() activation on the output nodes? Briefly, in theory you want to apply softmax() to the raw output nodes values (called “logits”) so that the sum of the output nodes is 1.0 and the values can be loosely interpreted as probabilities. Then you compare the pseudo-probabilities with the target output values. For example, a target output might be (0, 0, 1, 0) and the softmax computed output might be (0.1, 0.2, 0.6, 0.1). The differences between computed outputs and target outputs is then used to adjust the network weights so that the computed output values get better.

But PyTorch examples usually don’t use this approach. In turns out that computing softmax() is astonishingly difficult if you want to avoid arithmetic underflow or overflow. (Believe me, I’ve tried.) So, for the sake of engineering, PyTorch uses log_softmax() which significantly reduces the likelihood of arithmetic overflow (but unfortunately is still susceptible to underflow).

Somewhat unfortunately, the name of the PyTorch CrossEntropyLoss() is misleading because in mathematics, a cross entropy loss function would expect input values that sum to 1.0 (i.e., after softmax()’ing) but the PyTorch CrossEntropyLoss() function expects inputs that have had log_softmax() applied.

Put another way: computing softmax is error-prone. Computing log_softmax is less error-prone. Therefore PyTorch usually uses log_softmax, but this means you need the special NLLLoss() function. Because of this confusion, PyTorch combines the techniques into no activation plus CrossEntropyLoss() — which turns out to be even more confusing for beginers.

Details, details, details. But interesting, interesting, interesting.


An artificial neural network is a crude approximation of biological neurons. Both real neurons and artificial neurons have a lot of interesting detail. If you’ve ever looked at a bird feather closely, you’ll have noticed the incredible amount of tiny details it has. Left: Real feather earrings on actress Tia Carrere. Center: Real feather earrings on actress Patricia Velasquez. Right: Artificial feather earrings on actress Angelina Jolie. Both the real and the artificial feathers are very interesting to me because of the detail.

Posted in PyTorch | Leave a comment

PyTorch Binary Classification Using the Multi-Class Approach

The goal of a binary classification problem is to make a prediction where the result can be one of just two possibilities, for example predicting if a banknote is authentic or a forgery. All the neural network code libraries I use, including my library of choice PyTorch, treat binary classification problems and multi-class classification problems are two quite different types of problems. This always seemed a bit strange to me. I sat down one day and wondered if I could implement a binary classification problem using a multi-class classification approach. Answer: yes and the results were identical to the binary classification approach.

For a multi-class classification problem, you create a neural network that has the same number of output nodes as there are classes to predict. For example, if you are trying to predict a person’s political leaning of (conservative, moderate, liberal) based on things like age and income, you’d design a neural network with 3 output nodes. The output type is int64 to correspond to one-hot encoding such as a target output of (0, 1, 0). The output layer would use no activation because for training you use CrossEntropyLoss() which applies softmax automatically. The computed output is three values such as (2.345, -1.987, 4.5678) and the predicted class is the index of the largest output value, [2] in this case.



For a binary classification problem, you create a neural network that has one output node. The output is type float32. The output layer would use logistic-sigmoid activation so computed output is between 0 and 1. For training you use BinaryCrossEntropyLoss() which requires computed output to be between 0 and 1 and which does not apply sigmoid automatically. The computed output is a single value such as 0.345 and if the computed output is less than 0.5 the predicted class is 0 or if the computed output is greater than 0.5 the predicted class is 1.

I was pretty sure I could create a binary classifier using the multi-class approach. I created a network with two output nodes, and no output activation. For training I used CrossEntropyError() and so softmax is automatically applied during training.

Interestingly, for the dataset I experimented with (the Banknote authentication dataset) I got essentially identical results using the normal binary classification technique and using the modified multi-class classification approach.

Good experiment.


There are many binary pairs. Good vs. evil. Virtue vs. sin. “Dr. Yen Sin” was an early pulp science fiction magazine. It ran for only three issues in 1936. Left: ‘The Mystery of the Dragon’s Shadow” was the featured story in Issue #1. Center: “The Mystery of the Golden Skull” was featured in Issue #2. Right: “The Mystery of the Singing Mummies” headlined the final Issue #3. It seems odd to me to base a magazine on a villain rather than a hero, but a good, evil villain is usually more interesting than a hero.

Posted in PyTorch | Leave a comment

NFL 2020 Week 4 Predictions – Zoltar Hopes To Hit His Stride

Zoltar is my NFL prediction computer program. It uses a deep neural network and reinforcement learning. Typically, Zoltar has best results in weeks 4 through 12. Here are Zoltar’s predictions for week #4 of the 2020 NFL season:

Zoltar:        jets  by    3  dog =     broncos    Vegas:     broncos  by  2.5
Zoltar:      ravens  by    9  dog =    redskins    Vegas:      ravens  by 13.5
Zoltar:      texans  by    4  dog =     vikings    Vegas:      texans  by    4
Zoltar:    seahawks  by    4  dog =    dolphins    Vegas:    seahawks  by    7
Zoltar:       bears  by    6  dog =       colts    Vegas:       colts  by  2.5
Zoltar:      titans  by    5  dog =    steelers    Vegas:      titans  by  1.5
Zoltar:     jaguars  by    0  dog =     bengals    Vegas:     bengals  by    3
Zoltar:  buccaneers  by    6  dog =    chargers    Vegas:  buccaneers  by    7
Zoltar:      saints  by    5  dog =       lions    Vegas:      saints  by  5.5
Zoltar:     cowboys  by    5  dog =      browns    Vegas:     cowboys  by    5
Zoltar:   cardinals  by    0  dog =    panthers    Vegas:   cardinals  by    4
Zoltar:        rams  by   10  dog =      giants    Vegas:        rams  by 11.5
Zoltar:      chiefs  by    6  dog =    patriots    Vegas:      chiefs  by    7
Zoltar:       bills  by    0  dog =     raiders    Vegas:       bills  by  2.5
Zoltar: fortyniners  by    8  dog =      eagles    Vegas: fortyniners  by    6
Zoltar:     packers  by   11  dog =     falcons    Vegas:     packers  by    6

Zoltar theoretically suggests betting when the Vegas line is more than 3.0 points different from Zoltar’s prediction. For week #4 Zoltar has six hypothetical suggestions.

The teams that Zoltar likes in week #4 are:

1. Underdog NY Jets against the Broncos
2. Underdog Redskins against the Ravens
3. Underdog Bears against the Colts
4. Favorite Titans over the Steelers
5. Underdog Panthers against the Cardinals
6. Favorite Packers over the Falcons

Note: From my human perspective, these predictions look terrible. The Jets are a very bad team. The Redskins are a very bad team and the Ravens are a good team. The Bears have been lucky so far. The Titans have been lucky so far. The Panthers have some key injuries. The Falcons have been very unlucky so far. I would never bet my own real money on these suggestions, except maybe the Packers. But we’ll see. Zoltar is impassionate and doesn’t fully understand “lucky” (except to the extent that he takes blowout wins into account.

When you bet on an underdog, your bet pays off if the underdog wins by any score, or if the game is a tie, or if the favorite team wins but by more than the Vegas point spread. If the favorite team wins by exactly the point spread, the bet is a push. You lose your bet if the favorite wins by more than the Vegas point spread.

Theoretically, if you must bet $110 to win $100 (typical in Vegas) then you’ll make money if you predict at 53% accuracy or better. But realistically, you need to predict at 60% accuracy or better.

Just for fun, I track how well Zoltar does when just trying to predict just which team will win a game. This isn’t useful except for parlay betting.

Zoltar was weak in week #3. Against the Vegas point spread, Zoltar was 2-3. For the season, Zoltar is 12-8 (60.0%) against the spread. Just predicting winners, Zoltar was 11-4 which is pretty good. (There was one tie game, Eagles vs. the Bengals). Just picking winners, the Vegas line went 9-6 which isn’t very good.


My system is named after the Zoltar fortune teller machine you can find in arcades. Coin-operated fortune telling machines have been around for decades. Here are three very old machines I found on the Internet.

Posted in Zoltar | 2 Comments

Why You Should Not Use Neural Network Label Smoothing

Neural network label smoothing is a technique to prevent model overfitting. I never use label smoothing (LS) because:

1. LS introduces a new hyperparameter, which makes a complex system more complex, and results less interpretable.
2. LS modifies data, which is conceptually offensive and problematic in practice.
3. You can achieve a roughly equivalent LS effect by using weight decay or L1/L2 regularization.

I’ll explain label smoothing by using an example. Suppose you create a neural network classifier where there are three possible outcomes, for example, the Iris dataset where the three species to predict are setosa or versicolor or virginica. Your training data might look like:

5.1, 3.5, 1.4, 0.2,  1, 0, 0  # setosa
7.0, 3.2, 4.7, 1.4,  0, 1, 0  # versicolor
6.3, 2.9, 5.6, 1.8,  0, 0, 1  # virginica
. . .

The first four values on each line are predictors and the next three values are one-hot encoded species. An example of label smooting is to modify the training data to use “soft targets” like so:

5.1, 3.5, 1.4, 0.2,  0.8, 0.1, 0.1  # setosa
7.0, 3.2, 4.7, 1.4,  0.1, 0.8, 0.1  # versicolor
6.3, 2.9, 5.6, 1.8,  0.1, 0.1, 0.8  # virginica
. . .

This label smoothing approach sometimes reduces model overfitting so that when the trained model is presented with new, previously unseen data, the prediction accuracy is better than if you don’t use label smoothing.

Here’s a brief, hand-waving argument of what happens when you use LS training data. First, without LS, imagine you are updating the middle output node and the target value is 1 and the computed output value is 0.75 — you want to increase the weights that are connected to the node so that the computed output will increase and get closer to the target of 1.

Regardless of whether you are using cross entropy error or mean squared error, a weight delta is computed using the calculus derivative of the error function, and that delta always contains the error term (target – output), which is (1 – 0.75) = 0.25. That error will be modified by the learning rate, so if the learning rate is 0.01 the delta will contain 0.25 * 0.01 = 0.0025 and the weight will increase slightly.

Now on the next training iteration, suppose the computed output is 0.97. The error term is (1 – 0.97) = 0.03 and the delta will contain 0.03 * 0.01 = 0.0003 and the weight will increase but only by a tiny amount.

The ultimate effect of this training approach is that weight values could get very large, and large weight values sometimes give an overfitted model.

Now, suppose you’re using label smoothing. If the computed output is 0.75, the error term is (target – output) = (0.8 – 0.75) = 0.05 and the weight delta will contain 0.05 * 0.01 = 0.0005 and the weight will increase, but only by a small amount. Now on the next iteration, if the computed output is 0.97 the error term is (0.8 – 0.97) = -0.17 and the delta will contain -0.17 * 0.01 = -0.0017 and the weight value will decrease slightly.

The ultimate effect of the label smoothing approach is that weight values are usually prevented from getting very large, which can help prevent model overfitting.

Let me emphasize that this hand-waving argument has left out many important details.

OK. First problem with label smoothing: Where did the (0.1, 0.8, 0.1) soft targets come from? Why not (0.15, 0.70, 0.15) or (0.2, 0.6, 0.2) or something else? There’s no good answer to this question. Mathematically, label smoothing is usually presented as:

t’ = (1-a) * t + (a/K)

where t’ is the soft target, t is the original hard target (0 or 1), K is the number of classes, and a is any value between 0.0 and 1.0. For example, if a = 0.10 and K = 3, then a hard target of 1 becomes (1 – 0.10) * 1 + (0.10 / 3) = 0.9333 and the two 0 hard targets become 0.0333 each.

But this apparently sophisticated math basis is a hoax because there’s no good way to choose the value of a. In other words, the label smoothing values can be whatever you want. Ugly.

The second problem with label smoothing is that because the effect of LS is to restrict the magnitude of weight values, there are other simpler techniques that do this, such as weight decay, L1 regularization, and L2 regularization. Now, it’s true that these techniques don’t work exactly the same as LS, but the general principle is the same.

Finally, the worst problem with label smoothing in my opinion is that you are changing data. Philosophically this is just ugly, ugly, ugly. It’s true that you don’t have to physically change the training data — instead you can programmatically change the hard target values to label smoothed soft target values during training. But modifying data is almost always just wrong.

Let me wrap up by saying that when I did my research on label smoothing for this blog post, I was horrified by what I found on the Internet. Almost every blog post and short article, and even many formal research papers, had significant errors.

For example, almost all references either imply or explicitly state that there’s a necessary relation between label smoothing and cross entropy error. This is not correct. You can use label smoothing with cross entropy error or mean squared error or any other kind of error. When you use some form of error, the back-propagation technique uses the calculus derivative of the error function, not the error function itself, to compute a weight update delta value. The weight update term for all error functions contains a (target – output) term, and that term is the only place where label smoothing comes into play. For details, see my post at https://jamesmccaffrey.wordpress.com/2019/09/23/neural-network-back-propagation-weight-update-equation-mean-squared-error-vs-cross-entropy-error/.

I also read several Internet label smoothing articles that talked about “confidence” and “calibration” that were complete technical nonsense.

Incidentally, label smoothing has been around since at least the mid 1980s when it wasn’t uncommon to use 0.9 and 0.1 instead of 1 and 0 for binary classification. This is exactly equivalent to label smoothing with K = 2 and a = 0.2. It seems like the technique was forgotten in the late 1990s but then was “rediscovered” in the mid 2010s.

Thank you to my colleague Hyrum A. who pointed out a recent research paper that looked at label smoothing.


“Smooth douglasia” – a relatively rare wildflower that grows in the Pacific Northwest. “Smooth Operator” – a 1984 song by a British group called Sade. “Antelope Smooth Red Rock Canyon” – a beautiful slot canyon in Arizona. “Smooth haired dachshund” – originally bred in the early 1700s to hunt burrow-dwelling animals like badgers and rabbits. This dachshund puppy doesn’t look very threatening to burrow-dwelling animals or anything else.

Posted in Machine Learning | Leave a comment

Saving a PyTorch Model

Bottom line: There are two main ways to save a trained PyTorch neural model. You should use the newer “state_dict” approach rather than the older “full” approach.

The recommended way to save a PyTorch model looks like:

import torch as T

class Net():
  # define neural network here

def main():
  net = Net()  # create
  # train network

  path = ".\\Models\\my_model.pth"
  T.save(net.state_dict(), path)

if __name__ == "__main__":
  main()

Then to use the saved model in another file:

import torch as T

class Net():
  # exactly the same as above

def main():
  print("\nLoad using state_dict approach (preferred)")
  path = ".\\Models\\my_model.pth"
  model = Net()
  model.load_state_dict(T.load(path))
  
  # use the model to make predictions

if __name__ == "__main__":
  main()

The older approach looks very similar:

class Net():
  # define neural network here

# save old way (not preferred)
path = ".\\Models\\my_model.pth"
T.save(net, path)

# in another file:
class Net():
  # exactly the same as above

path = ".\\Models\\my_model.pth"
model = T.load(path)

You have to look at the code very carefully to see the differences between the old way and the newer state_dict approach. Notice that in both techniques, you must have the class definition of the neural network in the file that saves the model, and also in the file that loads the model.

I won’t try to explain why the newer state_dict approach is preferred — it’s really low-level details.

Just for fun, I coded up three complete working PyTorch programs to demonstrate. The first program creates a dummy neural network, computes an example output, and saves the model using both the state_dict way and also the older “full” way. The second program loads the state_dict model and computes an example output. The third program loads the older-format model and computes an example output. All three output values are the same.

In addition to saving a PyTorch model using the two ways I’ve explained here, you can also save a PyTorch model using the ONNX format, which I don’t recommend at this time. I’ll explain ONNX in another blog post sometime. Briefly, ONNX is new and still immature (so ONNX is not fully supported), and you can’t even run a saved ONNX model using PyTorch (you have to use an entirely different system to run the saved model).





Three (fashion) models saved (via photography). The photos were taken by Nina Leen (1910 – 1995) who was a famous photographer and was best known for her contributions to Life Magazine. Life Magazine was one of the most important means of communication in the world, especially from the years 1936 – 1972. These three old photos of models from the 1950s hold up very well today in my opinion.


Posted in PyTorch | Leave a comment

Machine Learning and DNA Kinship Analysis and Criminal Justice

When I was in college at U.C. Irvine, my original program of study was for a dual biology and chemistry degree. I switched to mathematics when I realized I had more passion for matrices than molecules. But I’ve always been interested in biochemistry.

An interesting news article caught my attention recently. DNA kinship analysis was used to solve a crime that took place 36 years ago. On November 22, 1984, a 14-year old girl named Wendy Jerome walked out of the door of her home in Rochester, New York after dinner at 7:00 PM to deliver a birthday card to her best friend who lived a few doors down the street.

Wendy’s body was found a few hours later behind a dumpster. She had been raped and then brutally beaten to death.

DNA matching did not exist in 1984. The first use of DNA matching in a criminal case occurred in 1986. But Rochester police saved Wendy’s clothes. Years later, DNA matching had become a common technique, but the unknown murderer’s DNA on Wendy’s clothes did not match any criminal in the CODIS database.


Left: Wendy Jerome. Right: The murderer, Timothy Williams from a police booking photo (with watermarks).


However, a recently developed technique, DNA kinship analysis, solved the crime in September 2020. Technicians analyzed the unknown murderer’s DNA and generated a list of criminals whose DNA was in CODIS and who were highly likely to be related to the murderer. This list quickly identified a suspect, Timothy Williams. Williams’ DNA was obtained, and it matched the DNA found on Wendy Jerome 36 years before, proving he was responsible for the crime. Williams is age 56 now so he was 20 years old when he raped and murdered Wendy Jerome.

The first DNA matching techniques, which were developed in the 1980s, are based on classical statistics, leading to statements like, “There is only one chance in 100 trillion that the DNA came from someone other than the suspect.” However, deep neural machine learning techniques are now being applied to DNA analysis, including kinship analysis.

Fascinating. Kinship analysis is part of a larger field of study called bioinformatics. I wish I knew more about bioinformatics, especially new techniques that use deep neural technologies. But with the Internet, I’m quite sure I’ll learn as time goes by.

This story illustrates incredible science — the best of humanity — and an evil person that represents the worst. It’s a good thing to bring criminals to justice, but I hope that some day machine learning and AI can be used to prevent crime before it happens.


Posted in Machine Learning | Leave a comment

Can a Neural Network Predict the Area of a Triangle?

While I was walking my dogs one weekend, I remembered an old problem: Can a neural network predict the area of a triangle?

Neural networks are very good at classification, for example predicting the species (setosa, versicolor, or virginica) of an iris flower, based on the flower’s petal length and width, and sepal length and width. And neural networks are quite good at some regression problems, such as predicting the median house price in a town based on the average size of houses in the town, the tax rate in the town, the nearness to the closest major city, and so on.

But neural networks are not really intended for ordinary math computations such as computing the area of a triangle based on base and height. In case your elementary school math is a bit rusty, I’ll remind you that the area of a triangle is 1/2 times the base times the height.

I work at a large tech company and PyTorch is the officially preferred neural network code library, as well as my personally preferred library. I decided to look at predicting the area of a triangle using PyTorch version 1.6, the current version on the weekend when I walking my dogs.

I wrote a program that programmatically generated 10,000 training examples where the base and height values were random values between 0.1 and 0.9 (and so the areas were between 0.005 and 0.405). I created a 2-(100-100-100-100)-1 neural network — 2 input nodes, four hidden layers with 100 nodes each, and a single output node. I used tanh activation on the hidden nodes, and no activation on the output nodes.

I trained the network using batches of 10 items for 1,000 epochs.

After training, the network correctly predicted 100% of the training items to within 10% of the correct area, and 100% of the training items to within 5% of the correct area, and 82% of the training items to with 1% of the correct area. Whether this is a good result or not depends upon your point of view.

Good fun. There’s a lot of buzz around deep learning and there’s a beehive of research activity on the topic. But it’s not magic.


On the same weekend I was thinking about triangles, I watched an old 1967 spy movie called “Deadlier than the Male” featuring female assassins with beehive hair styles. Left: Actress Elke Sommer played the primary assassin. I have no idea how that hair style works. Center and Right: An Internet image search returned quite a few images like these, so I guess the beehive style is still sometimes used today.


# triangle_area_nn.py
# predict area of triangle using PyTorch NN

import numpy as np
import torch as T
device = T.device("cpu")

class TriangleDataset(T.utils.data.Dataset):
  # 0.40000, 0.80000, 0.16000 
  #   [0]      [1]      [2]
  def __init__(self, src_file, num_rows=None):
    all_data = np.loadtxt(src_file, max_rows=num_rows,
      usecols=range(0,3), delimiter=",", skiprows=0,
      dtype=np.float32)

    self.x_data = T.tensor(all_data[:,0:2],
      dtype=T.float32).to(device)
    self.y_data = T.tensor(all_data[:,2],
      dtype=T.float32).to(device)

    self.y_data = self.y_data.reshape(-1,1)

  def __len__(self):
    return len(self.x_data)

  def __getitem__(self, idx):
    if T.is_tensor(idx):
      idx = idx.tolist()
    base_ht = self.x_data[idx,:]  # idx rows, all 4 cols
    area = self.y_data[idx,:]    # idx rows, the 1 col
    sample = { 'base_ht' : base_ht, 'area' : area }
    return sample

# ---------------------------------------------------------

def accuracy(model, ds):
  # ds is a iterable Dataset of Tensors
  n_correct10 = 0; n_wrong10 = 0
  n_correct05 = 0; n_wrong05 = 0
  n_correct01 = 0; n_wrong01 = 0

  # alt: create DataLoader and then enumerate it
  for i in range(len(ds)):
    inpts = ds[i]['base_ht']
    tri_area = ds[i]['area']    # float32  [0.0] or [1.0]
    with T.no_grad():
      oupt = model(inpts)

    delta = tri_area.item() - oupt.item()
    if delta < 0.10 * tri_area.item():
      n_correct10 += 1
    else:
      n_wrong10 += 1

    if delta < 0.05 * tri_area.item():
      n_correct05 += 1
    else:
      n_wrong05 += 1

    if delta < 0.01 * tri_area.item():
      n_correct01 += 1
    else:
      n_wrong01 += 1

  acc10 = (n_correct10 * 1.0) / (n_correct10 + n_wrong10)
  acc05 = (n_correct05 * 1.0) / (n_correct05 + n_wrong05)
  acc01 = (n_correct01 * 1.0) / (n_correct01 + n_wrong01)

  return (acc10, acc05, acc01)

# ----------------------------------------------------------

class Net(T.nn.Module):
  def __init__(self):
    super(Net, self).__init__()
    self.hid1 = T.nn.Linear(2, 100)  # 2-(100-100-100-100)-1
    self.hid2 = T.nn.Linear(100, 100)
    self.hid3 = T.nn.Linear(100, 100)
    self.hid4 = T.nn.Linear(100, 100)
    self.oupt = T.nn.Linear(100, 1)

    T.nn.init.xavier_uniform_(self.hid1.weight)  # glorot
    T.nn.init.zeros_(self.hid1.bias)
    T.nn.init.xavier_uniform_(self.hid2.weight)  # glorot
    T.nn.init.zeros_(self.hid2.bias)
    T.nn.init.xavier_uniform_(self.hid3.weight)  # glorot
    T.nn.init.zeros_(self.hid3.bias)
    T.nn.init.xavier_uniform_(self.hid4.weight)  # glorot
    T.nn.init.zeros_(self.hid4.bias)
    T.nn.init.xavier_uniform_(self.oupt.weight)  # glorot
    T.nn.init.zeros_(self.oupt.bias)

  def forward(self, x):
    z = T.tanh(self.hid1(x))  # or T.nn.Tanh()
    z = T.tanh(self.hid2(z))
    z = T.tanh(self.hid3(z))
    z = T.tanh(self.hid4(z))
    z = self.oupt(z)          # no activation
    return z

# ----------------------------------------------------------


def main():
  # 0. make training data file
  np.random.seed(1)
  T.manual_seed(1)
  hi = 0.9; lo = 0.1
  train_f = open("area_train.txt", "w")
  for i in range(10000):
    base = (hi - lo) * np.random.random() + lo
    height = (hi - lo) * np.random.random() + lo
    area = 0.5 * base * height
    s = "%0.5f, %0.5f, %0.5f \n" % (base, height, area)
    train_f.write(s)
  train_f.close()

  # 1. create Dataset and DataLoader objects
  print("Creating Triangle Area train DataLoader ")

  train_file = ".\\area_train.txt"
  train_ds = TriangleDataset(train_file)  # all rows
  bat_size = 10
  train_ldr = T.utils.data.DataLoader(train_ds,
    batch_size=bat_size, shuffle=True)

  # 2. create neural network
  print("Creating 2-(100-100-100-100)-1 regression NN ")
  net = Net()

  # 3. train network
  print("\nPreparing training")
  net = net.train()  # set training mode
  lrn_rate = 0.01
  loss_func = T.nn.MSELoss()
  optimizer = T.optim.SGD(net.parameters(),
    lr=lrn_rate)
  max_epochs = 1000
  ep_log_interval = 100
  print("Loss function: " + str(loss_func))
  print("Optimizer: SGD")
  print("Learn rate: 0.01")
  print("Batch size: 10")
  print("Max epochs: " + str(max_epochs))

  print("\nStarting training")
  for epoch in range(0, max_epochs):
    epoch_loss = 0.0            # for one full epoch
    epoch_loss_custom = 0.0
    num_lines_read = 0

    for (batch_idx, batch) in enumerate(train_ldr):
      X = batch['base_ht']  # [10,4]  base, height inputs
      Y = batch['area']     # [10,1]  correct area to predict

      optimizer.zero_grad()
      oupt = net(X)            # [10,1]  computed 

      loss_obj = loss_func(oupt, Y)  # a tensor
      epoch_loss += loss_obj.item()  # accumulate
      loss_obj.backward()
      optimizer.step()

    if epoch % ep_log_interval == 0:
      print("epoch = %4d   loss = %0.4f" % \
        (epoch, epoch_loss))
  print("Done ")

  # 4. evaluate model
  net = net.eval()
  (acc10, acc05, acc01) = accuracy(net, train_ds)
  print("\nAccuracy (.10) on train data = %0.2f%%" % \
    (acc10 * 100))
  print("\nAccuracy (.05) on train data = %0.2f%%" % \
    (acc05 * 100))
  print("\nAccuracy (.01) on train data = %0.2f%%" % \
    (acc01 * 100))

if __name__ == "__main__":
  main()
Posted in PyTorch | Leave a comment

NFL 2020 Week 3 Predictions – Zoltar Seeks Redemption

Zoltar is my NFL prediction computer program. It uses a deep neural network and reinforcement learning. Here are Zoltar’s predictions for week #3 of the 2020 NFL season:

Zoltar:     jaguars  by    6  dog =    dolphins    Vegas:     jaguars  by    3
Zoltar:    steelers  by    4  dog =      texans    Vegas:    steelers  by  3.5
Zoltar:    patriots  by    6  dog =     raiders    Vegas:    patriots  by  6.5
Zoltar:      eagles  by   10  dog =     bengals    Vegas:      eagles  by  6.5
Zoltar:      browns  by    6  dog =    redskins    Vegas:      browns  by    7
Zoltar:      titans  by    0  dog =     vikings    Vegas:      titans  by  2.5
Zoltar:       bills  by    4  dog =        rams    Vegas:       bills  by    3
Zoltar: fortyniners  by    5  dog =      giants    Vegas: fortyniners  by  4.5
Zoltar:       bears  by    0  dog =     falcons    Vegas:     falcons  by  3.5
Zoltar:       colts  by    5  dog =        jets    Vegas:       colts  by 10.5
Zoltar:    chargers  by    6  dog =    panthers    Vegas:    chargers  by    7
Zoltar:  buccaneers  by    0  dog =     broncos    Vegas:  buccaneers  by    6
Zoltar:   cardinals  by    6  dog =       lions    Vegas:   cardinals  by    6
Zoltar:    seahawks  by    6  dog =     cowboys    Vegas:    seahawks  by    5
Zoltar:     packers  by    0  dog =      saints    Vegas:      saints  by  3.5
Zoltar:      ravens  by    6  dog =      chiefs    Vegas:      ravens  by    3

Zoltar theoretically suggests betting when the Vegas line is more than 3.0 points different from Zoltar’s prediction. For week #3 Zoltar has five hypothetical suggestions. All of them are highly questionable because during weeks 1-3 Zoltar doesn’t have much data yet and with limited data, Zoltar likes underdogs.

The five teams (four underdogs, one favorite) that Zoltar likes in week #3 are:

1. Zoltar likes the Vegas favorite Eagles over the Bengals.
2. Zoltar likes the Vegas underdog Bears against the Falcons.
3. Zoltar likes the Vegas underdog Jets against the Colts.
4. Zoltar likes the Vegas underdog Broncos against the Buccaneers.
5. Zoltar likes the Vegas underdog Packers against the Saints.

Note: I’ve clearly got some bad data or a bug in Zoltar — there’s no way that Zoltar should favor the winless NY Giants over the excellent SF 49ers team. I’ll have to tear apart my data when I get a chance.
Another update: Argh! I messed up my data files completely. I’ll need to rerun predictions for weeks 1 – 3.
Update: I’ve rerun Zoltar’s predictions. The ones posted here were made using correct data.

When you bet on an underdog your bet pays off if the underdog wins by any score, or if the game is a tie, or if the favorite team wins but by less than the Vegas point spread. You lose your bet only if the favorite team wins by more than the Vegas point spread. If the favorite team wins by exactly the point spread, the bet is a push.

Theoretically, if you must bet $110 to win $100 (typical in Vegas) then you’ll make money if you predict at 53% accuracy or better. But realistically, you need to predict at 60% accuracy or better.

Just for fun, I track how well Zoltar does when just trying to predict just which team will win a game. This isn’t useful except for parlay betting.

Zoltar did poorly in week #2. Against the Vegas point spread, Zoltar was only 3-3. Darn.

Just predicting winners, Zoltar was an excellent 14-2. Just picking winners, the Vegas line also went 14-2 which is the best one-week result for Vegas I can ever remember.


Left: My system is named after the Zoltar fortune teller machine you can find in arcades. Center and Right: Fortune teller machines have been around for decades. Here are two old ones I found on the Internet.

Posted in Zoltar | 4 Comments

Experimental Fighter Planes During World War II

I’ve always been interested in history from all eras. War is awful but war often leads to fast technological advances. Here are some experimental U.S. fighter planes that were built during World war II (1940 – 1945). None of these planes went into production because they weren’t significantly better than designs already in production. By 1945, all design efforts had been focused on jet aircraft — but that’s another blog post.

I think we’re in the very early stage of deep learning. Perhaps the development of quantum computing will be the jet engine of deep learning.


Shown below are six of the U.S. land-based fighter planes that were already in production. The “P” stands for “pursuit” (fighter) and “XP” stands for “experimental pursuit”.

Top row. Left: Lockheed P-38 Lightning. Center: Bell P-39 Airacobra. Right: Curtiss P-40 Warhawk.

Bottom row. Left: Republic P-47 Thunderbolt. Center: North American P-51 Mustang. Right: Vought F4U Corsair (originally intended for aircraft carrier use, switched to land-based).


1. Curtiss XP-46 (1941) – Intended to be a successor to the existing P-40 plane, but it’s performance wasn’t better than the P-40D model.


2. Grumman XP-50 (1941) – Not ordered for production but the design evolved into the successful F7F Tigercat.


3. Bell XP-52 (1941) – Advanced design that would have featured contra-rotating pusher and swept wings. Canceled because of other higher priority designs, including the P-59 Airacomet jet plane. (The XP-52 is the only plane listed that didn’t have at least one prototype built, but it looked too cool to leave out.)


4. Vultee XP-54 (1943) – Did not exceed the performance of existing production aircraft.


5. Curtiss XP-55 (1943) – It’s performance did not meet expectations.


6. Northrop XP-56 (1943) – Proved to be an unstable design.


7. Curtiss XP-60 (1941) – Intended to be a successor to the existing P-40. Development not pursued because of other war-time production priorities.


8. Curtiss XP-62 (1943) – Had good performance but development was not pursued because of other, higher priority efforts.


9. McDonnell XP-67 (1944) – Very unusual design but only had performance equivalent to existing aircraft already in production.


10. Republic XP-72 (1944) – Had excellent performance but attention had turned to the first jet-powered aircraft.


11. Fisher XP-75 (1943) – Twin contra-rotating propellers. Performed well but not significantly better than existing P-51 already in production.


12. Bell XP-77 (1944) – Explored the idea of a very small, very lightweight design. Ultimately, large, heavy designs proved to be much better.


13. Vultee XP-81 (1945) – Combined two small jets with a regular engine. Excellent performance but by the time it first flew, it was clear that fully jet-powered planes were the future.


Posted in Miscellaneous | Leave a comment

PyTorch Multi-Class Classification Using the MSELoss() Function

When I first learned how to create neural networks, there were no good code libraries available. So I, and everyone else at the time, implemented neural networks from scratch using the basic theory. In particular, for multi-class classification, the technique was to use one-hot encoding on the training data, and softmax() activation on the output nodes, and use mean squared error during training.

For example, if there are 3 classes then a target might be (0, 1, 0) and a computed output might be (0.10, 0.70, 0.20), and the squared error would be (0 – 0.10)^2 + (1 – 0.70)^2 + (0 – 0.20)^2.

Now fast forward several years and the PyTorch library. Weirdly, I couldn’t find any examples of multi-class classification using the traditional approach. Instead all the examples used ordinal encoding for the training data, and no activation on the output nodes, and CrossEntropyLoss() during training. It was quite digitally mysterious to me.

After many hours of experimentation I figured out was going on. But that would takes a ton of explanation. I sat down one day to implement a PyTorch multi-class classifier using the old, traditional approach.

I used the Iris Dataset example. First I created training and test data where the species-to-predict was one-hot encoded. The data looks like:

5.1, 3.5, 1.4, 0.2, 1, 0, 0
5.6, 3.0, 4.5, 1.5, 0, 1, 0
6.5, 3.2, 5.1, 2.0, 0, 0, 1
. . .

Next I coded a 4-7-3 neural network that had softmax() activation on the output nodes. Then I coded training using the MSELoss() function.

Interestingly, even though everything worked, the results weren’t quite as good as the now-normal ordinal encoding, no-activation, CrossEntropyLoss() approach in the sense that training took a bit longer to get good results.

After I finished my experiment, I realized that there’s an alternative approach. Instead of creating a file of training data where the labels-to-predict are one-hot encoded such as (0, 0, 1, 0), I could use a file where the labels are ordinal encode such as 2, and then write a Dataset class that reads the ordinal encoded data and then converts it to one-hot encoding. When I get some time, I’ll try that approach out and post my comments.

Well, that was a very satisfying experiment. I’m always pleased when I figure out something new. It’s very much like solving a puzzle.


Three interesting mixed media images related to “digitally mysterious”, at least according to a Google image search. I’m not a big fan of ordinary photography as art, or ordinary digital art, but when digital and photography are combined, sometimes the results can be appealing.


# iris_nll_loss.py
# one-hot + softmax + MSELoss (traditional approach)
# PyTorch 1.6.0-CPU Anaconda3-2020.02  Python 3.7.6
# Windows 10 

import numpy as np
import torch as T
device = T.device("cpu")  # apply to Tensor or Module

# -----------------------------------------------------------

class IrisDataset(T.utils.data.Dataset):
  def __init__(self, src_file, num_rows=None):
    # 5.0, 3.5, 1.3, 0.3, 1, 0, 0
    # . . .
    self.data = np.loadtxt(src_file, max_rows=num_rows,
      usecols=range(0,7), delimiter=",", skiprows=0,
      dtype=np.float32)

    self.num_rows=num_rows  # not essential

  def __len__(self):
    return len(self.data)

  def __getitem__(self, idx):
    if T.is_tensor(idx):
      idx = idx.tolist()
    preds = T.tensor(self.data[idx, 0:4], 
      dtype=T.float32).to(device)
    spcs = T.tensor(self.data[idx, 4:7], 
      dtype=T.float32).to(device)
    sample = { 'predictors' : preds, 'species' : spcs }

    return sample

# -----------------------------------------------------------

class Net(T.nn.Module):
  def __init__(self):
    super(Net, self).__init__()
    self.hid1 = T.nn.Linear(4, 7)  # 4-7-3
    self.oupt = T.nn.Linear(7, 3)

    T.nn.init.xavier_uniform_(self.hid1.weight)
    T.nn.init.zeros_(self.hid1.bias)
    T.nn.init.xavier_uniform_(self.oupt.weight)
    T.nn.init.zeros_(self.oupt.bias)

  def forward(self, x):
    z = T.tanh(self.hid1(x))
    z = T.nn.functional.softmax(self.oupt(z), dim=1) # rows 
    return z

# -----------------------------------------------------------

def accuracy(model, dataset):
  # assumes model.eval()
  dataldr = T.utils.data.DataLoader(dataset, batch_size=1,
    shuffle=False)
  n_correct = 0; n_wrong = 0
  for (_, batch) in enumerate(dataldr):
    X = batch['predictors']
    Y = T.flatten(batch['species'])
    oupt = model(X)  # logits form

    comp_idx = T.argmax(oupt)
    targ_idx = T.argmax(Y)

    if comp_idx == targ_idx:
      n_correct += 1
    else:
      n_wrong += 1

  acc = (n_correct * 100.0) / (n_correct + n_wrong)
  return acc

# -----------------------------------------------------------

def main():
  # 0. get started
  print("\nBegin Iris with MSELoss demo \n")
  T.manual_seed(1)
  np.random.seed(1)
  
  # 1. create DataLoader objects
  print("Creating Iris train and test DataLoader ")

  train_file = ".\\Data\\iris_train_hot.txt"
  test_file = ".\\Data\\iris_test_hot.txt"

  train_ds = IrisDataset(train_file, num_rows=120)
  test_ds = IrisDataset(test_file)

  bat_size = 10
  train_ldr = T.utils.data.DataLoader(train_ds,
    batch_size=bat_size, shuffle=True)
  test_ldr = T.utils.data.DataLoader(test_ds,
    batch_size=1, shuffle=False)

  # 2. create network
  net = Net().to(device)

  # 3. train model
  max_epochs = 20
  ep_log_interval = 2
  lrn_rate = 0.12

  loss_func = T.nn.MSELoss(reduction='mean')  # assumes softmax
  optimizer = T.optim.SGD(net.parameters(), lr=lrn_rate)

  print("\nbat_size = %3d " % bat_size)
  print("loss = " + str(loss_func))
  print("optimizer = SGD")
  print("max_epochs = %3d " % max_epochs)
  print("lrn_rate = %0.3f " % lrn_rate)

  print("\nStarting training")
  net.train()
  for epoch in range(0, max_epochs):
    epoch_loss = 0  # for one full epoch
    num_lines_read = 0

    for (batch_idx, batch) in enumerate(train_ldr):
      # print("  batch = " + str(batch_idx))
      X = batch['predictors']  # [10,4]
      Y = batch['species']
      # num_lines_read += bat_size  # early exit
      optimizer.zero_grad()
      oupt = net(X)

      loss_obj = loss_func(oupt, Y)  # a tensor
      epoch_loss += loss_obj.item()  # accumulate
      loss_obj.backward()
      optimizer.step()

    if epoch % ep_log_interval == 0:
      print("epoch = %4d   loss = %0.4f" % (epoch, epoch_loss))
  print("Done ")

  # 4. evaluate model accuracy
  print("\nComputing model accuracy")
  net.eval()

  acc = accuracy(net, test_ds)  # item-by-item
  print("Accuracy on test data = %0.2f%%" % acc)

  # 5. make a prediction
  np.set_printoptions(precision=4)

  print("\nPredicting species for [6.1, 3.1, 5.1, 1.1]: ")
  unk = np.array([[6.1, 3.1, 5.1, 1.1]], dtype=np.float32)
  unk = T.tensor(unk, dtype=T.float32).to(device) 

  probs = net(unk)
  print(probs)

  # 6. save model
  print("\nSaving trained model ")
  fn = ".\\Models\\iris_model.pth"
  T.save(net.state_dict(), fn)

  print("\nEnd Iris demo")

if __name__ == "__main__":
  main()

Posted in PyTorch | Leave a comment