Computing log_softmax() for PyTorch Directly

In a PyTorch multi-class classification problem, the basic architecture is to apply log_softmax() activation on the output nodes, in conjunction with NLLLoss() during training. It’s possible to compute softmax() and then apply log() but it’s slightly more efficient to compute log_softmax() directly.

Computing softmax() looks like:

import torch as T

def softmax(x):
  mx = T.max(x)
  y = T.exp(x - mx)
  return y / T.sum(y)

Finding the max value is just a math trick to avoid arithmetic overflow.

Computing log_softmax() directly looks like:

def log_softmax(x):
  mx = T.max(x)
  lse = T.log(T.sum(T.exp(x - mx)))
  return x - mx - lse 

The reason why log_softmax() is applied to the output nodes is rather subtle. If the target class is at index [i] then the negative log likelihood loss is just the negative of log_softmax() value at [i]. For example, if the log_softmax of a neural output is [-1.6563, -1.7563, -1.5563], and the target class label is [2], then the NLLLoss() is -(-1.5563) = 1.5563. Quite remarkable. One way to think of the log_softmax() plus NLLLoss() pairing is that log_softmax() actually computes the error and NLLLoss() just extracts the error.

If you just have a single set of log-softmax outputs and a single target class label, you could write an NLLLoss() like so:

def my_nll_loss(oupt, target):
  # oupt is a vector of log-softmax values
  result = -oupt[target]
  return result

If you have a batch of output values and a vector of targets, you can use the clever diag() function like so:

def my_nll_loss(oupt, targets):
  # oupt is a matrix of log-softmax values
  out = T.diag(oupt[:,targets])  # one val from each row
  return -T.mean(out)

In the early days of neural networks, you’d compute softmax() on the output nodes and then explicit CrossEntropy() loss. The softmax() plus CrossEntropy() loss approach and the log_softmax() plus NLLLoss() approach give the same results but the log_softmax() plus NLLLoss() approach is more efficient from an engineering perspective.



In the old science fiction movies I enjoy, efficiency was sometimes achieved by reusing special effects snippets. A cool spaceship appeared in four different movies. Left: “Flight to Mars” (1951) was quickly produced in just a few weeks to take advantage of the publicity surrounding the Academy Award winning “Destination Moon” (1950). The spaceship for “Flight to Mars” was reused three times. Center-Left: “World Without End” (1956) is an OK film. Center-Right: “It! The Terror from Beyond Space” (1958) is a landmark film and the direct inspiration for “Alien” (1979). Right: “Queen of Outer Space” (1958) is better than you might guess based on the title.


Posted in PyTorch | Leave a comment

Why Neural Network Training Momentum Isn’t Used Very Often

During neural network training, it’s possible to use a momentum factor. Momentum is a technique designed to speed up training. But I hardly ever see momentum used. The main problem with momentum is that it adds another hyperparameter, the momentum factor, and the time spent determining a good value for the momentum factor outweighs the benefit in speed.

There are two types of momentum — plain momentum and Nesterov momentum. Nesterov momentum is a more technically sophisticated version of regular momentum. See https://jamesmccaffrey.wordpress.com/2017/07/24/neural-network-nesterov-momentum/.

As usual, the idea is best explained by a concrete example. In the images below, I use no momentum, regular momentum (factor = 0.95), and Nesterov momentum (0.95). If you look at the loss values, you can see the the two momentum runs do in fact train faster. But if you look at the accuracy metrics, you can see that the no-momentum version has the best test accuracy. The point is that training speed isn’t the only thing that’s important.



Left: No momentum. Center: Regular momentum, factor = 0.95. Right: Nesterov momentum, factor = 0.95.


The key statements are:

max_epochs = 1000
ep_log_interval = 100
lrn_rate = 0.01
loss_func = T.nn.NLLLoss()  # assumes log_softmax()

# 1. optimizer = T.optim.SGD(net.parameters(), lr=lrn_rate)
# 2. optimizer = T.optim.SGD(net.parameters(), lr=lrn_rate,
#      momentum=0.95)
# 3. optimizer = T.optim.SGD(net.parameters(), lr=lrn_rate, 
#      nesterov=True, momentum=0.95, dampening=0)
. . .

I’m using stochastic gradient descent (SGD). The learning rate is required and tuning the learning rate is a major challenge. The first version doesn’t use any momentum. The second version uses regular momentum with factor 0.95. The third version uses Nesterov momentum with factor 0.95 (the dampening of 0 is required for Nesterov).

So, without going into all the technical details, it’s hard enough to find a good learning rate, and if you add trying to find a good value for the momentum factor, you greatly complicate things.

Neural network training momentum is one of several topics that are great in theory, but just don’t work too well in practice.



When I was a college student, I did well in math classes but poorly in physics classes. I never did quite figure out momentum, inertia, and angular momentum. Angular momentum was often illustrated by a spinning bicycle wheel.

The Raleigh Bicycle Company was founded in 1885 in Nottingham, England. Three pieces of old Raleigh advertising that are interesting but somewhat difficult to figure out. Left: Why the jet? Why the ominous sky? Center: Why . . . all of it? Right: What is she doing and what does it have to do with bicycles?


Posted in Machine Learning | 1 Comment

NFL 2022 Week 13 Predictions – Zoltar Likes the Raiders to Beat the Chargers

Zoltar is my NFL football prediction computer program. It uses reinforcement learning and a neural network. Here are Zoltar’s predictions for week #13 of the 2022 season.

Zoltar:    patriots  by    1  dog =       bills    Vegas:       bills  by    5
Zoltar:    steelers  by    0  dog =     falcons    Vegas:     falcons  by    1
Zoltar:     packers  by    0  dog =       bears    Vegas:     packers  by    3
Zoltar:       lions  by    4  dog =     jaguars    Vegas:       lions  by    1
Zoltar:      browns  by    0  dog =      texans    Vegas:      browns  by  7.5
Zoltar:     vikings  by    6  dog =        jets    Vegas:     vikings  by    3
Zoltar:  commanders  by    0  dog =      giants    Vegas:  commanders  by  2.5
Zoltar:      titans  by    0  dog =      eagles    Vegas:      eagles  by  5.5
Zoltar:      ravens  by    4  dog =     broncos    Vegas:      ravens  by    8
Zoltar:        rams  by    6  dog =    seahawks    Vegas:    seahawks  by    8
Zoltar: fortyniners  by    5  dog =    dolphins    Vegas: fortyniners  by    3
Zoltar:      chiefs  by    0  dog =     bengals    Vegas:      chiefs  by  2.5
Zoltar:     raiders  by    4  dog =    chargers    Vegas:    chargers  by    2
Zoltar:     cowboys  by    6  dog =       colts    Vegas:     cowboys  by  9.5
Zoltar:  buccaneers  by    6  dog =      saints    Vegas:  buccaneers  by  3.5

Zoltar theoretically suggests betting when the Vegas line is “significantly” different from Zoltar’s prediction. For this season I’ve been using a threshold of 4 points difference but in some previous seasons I used 3 points.

At the beginning of the season, because of Zoltar’s initialization (all teams regress to an average power rating) and other algorithms, Zoltar is very strongly biased towards Vegas underdogs. I probably need to fix this. For week #13 Zoltar likes five Vegas underdogs:

1. Zoltar likes Vegas underdog Patriots against the Bills.
2. Zoltar likes Vegas underdog Texans against the Browns.
3. Zoltar likes Vegas underdog Titans against the Eagles.
4. Zoltar likes Vegas underdog Rams against the Seahawks.
5. Zoltar likes Vegas underdog Raiders against the Chargers.

For example, a bet on the underdog Patriots against the Bills will pay off if the Patriots win by any score, or if the favored Bills win but by less than 5 points (i.e., 4 points or less). If a favored team wins by exactly the point spread, the wager is a push. This is why point spreads often have a 0.5 added — called “the hook” — to eliminate pushes.

Theoretically, if you must bet $110 to win $100 (typical in Vegas) then you’ll make money if you predict at 53% accuracy or better. But realistically, you need to predict at 60% accuracy or better.

In week #12, against the Vegas point spread, Zoltar went 2-3 using 4.0 points as the advice threshold, but went 6-4 using a 3.0 threshold. Weeks 12 to 18 are the most difficult for Zoltar.

For the season, against the spread, Zoltar is 41-22 (~65% accuracy).

Just for fun, I track how well Zoltar does when just trying to predict just which team will win a game. This isn’t useful except for parlay betting. In week #12, just predicting the winning team, Zoltar went 12-4 which is OK but not great. Vegas was similar with 11-5 at just predicting the winning team.

Zoltar sometimes predicts a 0-point margin of victory. There are six such games in week #13. In those situations, to pick a winner (only so I can track raw number of correct predictions) in the first few weeks of the season, Zoltar picks the home team to win. After that, Zoltar uses his algorithms to pick a winner.



My system is named after the Zoltar fortune teller machine you can find in arcades. That machine is named after the Zoltar machine from the 1988 movie “Big”. And the 1988 Zoltar was named after the “Zoltan” arcade fortune teller from the 1960s.

I’ve always been fascinated by electro-mechanical arcade devices. Center: The “Mystic Ray” machine (circa 1950) actually wrote out a fortune using a pen. Amazing tech for the time.

Right: The “Zodi” machine (circa 1940) actually typed out a fortune using a pneumatic powered typewriter. Also amazing.


Posted in Zoltar | Leave a comment

Mahalanobis Distance Example Using Python

Suppose you have a source dataset of five items where each item is a person’s height, test-score, age:

[64.0, 580.0, 29.0]
[66.0, 570.0, 33.0]
[68.0, 590.0, 37.0]
[69.0, 660.0, 46.0]
[73.0, 600.0, 55.0]

And suppose you want the Mahalanobis distance between [66.0, 570.0, 33.0] and [69.0, 660.0, 46.0]. In an old blog post, I showed how to compute the Mahalanobis distance from scratch, using Excel. Just for fun, I decided to verify my Excel calculations by writing a short Python / NumPy / SciPy language program. The scipy.spatial package has a Mahalanobis distance function but it requires the inverse of the covariance matrix of the source dataset.




The Python program and the Excel-from-scratch calculations give the same result.


My demo program worked as expected:

# mahalanobis_demo.py

import numpy as np
import scipy.spatial

print("\nBegin Mahalanobis distance demo ")

data = np.array([[64.0, 580.0, 29.0],
                 [66.0, 570.0, 33.0],
                 [68.0, 590.0, 37.0],
                 [69.0, 660.0, 46.0],
                 [73.0, 600.0, 55.0]])
print("\nSource dataset: ")
print(data)

cm = np.cov(data, rowvar=False)
print("\nCovariance matrix: ")
print(cm)

np.set_printoptions(precision=4, suppress=True)
icm = np.linalg.inv(cm)
print("\nInverse covar matrix: ")
print(icm)

u = np.array([66.0, 570.0, 33.0])
v = np.array([69.0, 660.0, 46.0])
md = scipy.spatial.distance.mahalanobis(u, v, icm)
print("\nu = ", end=""); print(u)
print("v = ", end = ""); print(v)
print("\nMahalanobis distance(u,v) = %0.4f " % md)  # 2.5536

print("\nEnd demo ")

I set up my data row-by-row, and so when I computed the covariance matrix, I had to use the rowvar=False argument because the default geometry is column-by-column.

Good fun.



Forced perspective is an optical technique that tricks a viewer’s sense of distance. The technique is often used in conjunction with a small close model to make it appear large and far away. Here’s an example of forced perspective used in the first Star Wars (1977) movie.


Posted in Machine Learning | Leave a comment

Custom Loss Functions for PyTorch

The PyTorch neural network code library has built-in loss functions that can handle most scenarios. Examples include NLLLoss() and CrossEntropyLoss() for multi-class classification, BCELoss() for binary classification, and MSELoss() and L1Loss() for regression. Because PyTorch works at a low level of abstraction, it’s possible to write custom loss functions.

To explore this idea, I pulled out one of my standard regression examples, the Boston Area Housing Dataset. There are 506 data items. Each item represents one of 506 towns near Boston. The dataset has 14 columns:

[0] = crime rate / 100, [1] = pct large lots / 100,
[2] = pct business / 100, [3] = adj to river (-1 = no, +1 = yes),
[4] = pollution / 1, [5] = avg num rooms / 10,
[6] = pct built before 1940 / 100, [7] = distance to boston / 100,
[8] = access to highways / 100, [9] = tax rate / 1000,
[10] = pupil-teacher ratio / 100, [11] = density Blacks / 1000,
[12] = pct low socio-economic / 100,
[13] = median house price / 100_000

I normalized the raw data by dividing each column by a constant so that all values (except for the Boolean [3]) are between 0.0 and 1.0. I split the data into a 400-item training set and a 106-item test set.

The usual goal is to predict the median house price [13] in a town from the other variables. I modified the standard example to predict both poverty [12] and price [13] from the other 12 variables. To test the idea of a custom loss function, I ran three micro-experiments.

First, I created and evaluated a 12-(10-10-10)-2 dual-regression model using the built-in L1Loss() function. Second, I used a from-scratch version of L1 loss to make sure I understood exactly how the PyTorch implementation of L1 loss works. The output of the second experiment was identical to the output of the first experiment, as expected.

Third, I wrote a custom loss function that weights poverty [12] twice as much as price [13]. When I ran the program, the results were similar to the first two (actually, surprisingly, somewhat better) indicating that my custom loss function worked.



Left: Using the built-in L1Loss() function. Center: Using a from-scratch version of L1 loss gives the same results. Right: Using a custom L1 loss function that weights poverty twice as much as price.


The key code for using the built-in L1Loss() function is:

import torch as T
device = T.device('cpu')

loss_func = T.nn.L1Loss()  # mean avg error
. . .
loss_val = loss_func(oupt, y)

The built-in loss function is actually a PyTorch Module and so the code is calling the class forward() method.

My from-scratch L1 loss function looks like:

def my_L1Loss(output, target):
  loss = T.mean(T.abs(output - target))
  return loss

. . .
loss_val = my_L1Loss(oupt, y)

And the custom L1 loss function looks like:

def weighted_L1Loss(output, target):
  # weight poverty twice as much as price
  wts = T.tensor([2,1], dtype=T.float32).to(device) # by cols
  weighted_outputs = T.mul(output, wts)
  weighted_targets = T.mul(target, wts)
  loss = T.mean(T.abs(weighted_outputs - weighted_targets))
  return loss

. . .
loss_val = weighted_L1Loss(oupt, y)

The implementation was trickier than I expected and I made quite a few mistakes before I found the right path.

Good fun. See https://jamesmccaffrey.wordpress.com/2022/10/21/regression-with-multiple-output-values-using-pytorch/ for the program and link to the training and test data.



Software development often uses built-in modules. Left: The first Lego blocks in 1949 were exact copies of blocks from a U.S. company called Kiddicraft. Lego introduced cylinder anti-studs inside each block in 1958, which created the modern form. Right: When I was a young man, I played with American Bricks from the Elgo company. (It’s not clear if “Lego” took its name from “Elgo” or not). Elgo American Bricks pre-date Lego bricks by ten years.

There are theories that suggest that many engineers were helped on their career paths by playing with construction sets when they were boys. Possibly. But the reverse could be true: boys who are pre-disposed to have an engineering brain are naturally attracted to play with construction sets rather than alternatives such as outdoor sports. Or possibly both correlations are true. There are many research results that show boys and girls have definite preferences for different types of toys. The toy preference results are even true for monkeys. Weird.


Posted in PyTorch | Leave a comment

“Researchers Explore Machine Learning Hyperparameter Tuning Using Evolutionary Optimization” on the Pure AI Web Site

I contributed to an article titled “Researchers Explore Machine Learning Hyperparameter Tuning Using Evolutionary Optimization” in the November 2022 edition of the Pure AI web site. See https://pureai.com/articles/2022/11/01/evolutionary-optimization.aspx.

When data scientists create a machine learning prediction model, there are typically about 10 to 20 hyperparameters — variables where the values must be determined by trial and error guided by experience and intuition.

Finding good values for model hyperparameters is called hyperparameter tuning. For simple machine learning models, the most common tuning approach used by data scientists is to manually try different permutations of hyperparameter values.

The artcile describes three different techniques for machine learning hyperparameter tuning. Briefly, an old technique called evolutionary optimization works well for the new generation of neural systems, including transformer architecture systems, that have billions of weights and billions of hyperparameter combinations.



In pseudo-code:

create population of random solutions
loop max_generation times
  pick two parent solutions
  use crossover to create a child solution
  use mutation to modify child solution
  evaluate the child solution
  replace a weak solution in population with child
end-loop
return best solution found

I’m quoted in the article: “Evolutionary optimization for hyperparameter tuning was used as early as the 1980s when even simple neural prediction models were major challenges because of the limited machine memory and CPU power.”

“The recent focus on huge prediction models for natural language processing based on transformer architecture, where a single training run can take days and cost thousands of dollars, has spurred renewed interest in evolutionary optimization for hyperparameter tuning.”



Mutated animals (usually by radiation) that become intelligent are a staple of science fiction. Researcher Viktor Toth has taught rats how to play Doom II. No mutation needed.


Posted in Machine Learning | Leave a comment

NFL 2022 Week 12 Predictions – Zoltar Likes Seven Underdogs

Zoltar is my NFL football prediction computer program. It uses reinforcement learning and a neural network. Here are Zoltar’s predictions for week #12 of the 2022 season.

Zoltar:       bills  by    1  dog =       lions    Vegas:       bills  by   10
Zoltar:     cowboys  by    8  dog =      giants    Vegas:     cowboys  by    9
Zoltar:     vikings  by    1  dog =    patriots    Vegas:     vikings  by  2.5
Zoltar:    panthers  by    1  dog =     broncos    Vegas:     broncos  by  2.5
Zoltar:  buccaneers  by    0  dog =      browns    Vegas:  buccaneers  by  3.5
Zoltar:      ravens  by    0  dog =     jaguars    Vegas:      ravens  by    4
Zoltar:    dolphins  by    6  dog =      texans    Vegas:    dolphins  by 12.5
Zoltar:        jets  by    1  dog =       bears    Vegas:        jets  by  4.5
Zoltar:      titans  by    6  dog =     bengals    Vegas:     bengals  by    2
Zoltar:  commanders  by    2  dog =     falcons    Vegas:  commanders  by    4
Zoltar:   cardinals  by    6  dog =    chargers    Vegas:    chargers  by  4.5
Zoltar:     raiders  by    0  dog =    seahawks    Vegas:    seahawks  by  3.5
Zoltar:      chiefs  by    2  dog =        rams    Vegas:      chiefs  by 14.5
Zoltar: fortyniners  by    4  dog =      saints    Vegas: fortyniners  by    9
Zoltar:     packers  by    0  dog =      eagles    Vegas:      eagles  by    7
Zoltar:       colts  by    2  dog =    steelers    Vegas:       colts  by  2.5

Zoltar theoretically suggests betting when the Vegas line is “significantly” different from Zoltar’s prediction. For this season I’ve been using a threshold of 4 points difference but in some previous seasons I used 3 points.

At the beginning of the season, because of Zoltar’s initialization (all teams regress to an average power rating) and other algorithms, Zoltar is very strongly biased towards Vegas underdogs. I probably need to fix this. For week #12 Zoltar likes seven Vegas underdogs:

1. Zoltar likes Vegas underdog Lions against the Bills.
2. Zoltar likes Vegas underdog Texans against the Dolphins.
3. Zoltar likes Vegas underdog Titans against the Bengals.
4. Zoltar likes Vegas underdog Cardinals against the Chargers.
5. Zoltar likes Vegas underdog Rams against the Chiefs.
6. Zoltar likes Vegas underdog Saints against the 49ers.
7. Zoltar likes Vegas underdog Packers against the Eagles.

For example, a bet on the underdog Lions against the Bills will pay off if the Lions win by any score, or if the favored Bills win but by less than 10 points (i.e., 9 points or less). If a favored team wins by exactly the point spread, the wager is a push. This is why point spreads often have a 0.5 added — called “the hook” — to eliminate pushes.

Theoretically, if you must bet $110 to win $100 (typical in Vegas) then you’ll make money if you predict at 53% accuracy or better. But realistically, you need to predict at 60% accuracy or better.

In week #11, against the Vegas point spread, Zoltar went 3-2 (using 4.0 points as the advice threshold). Zoltar missed on predicting underdogs Steelers (vs. Bengals) and Cardinals (vs. 49ers). The Steelers lost by 37-30, not quite close enough to the 5.0 point Vegas line. The Cardinals lost badly, 38-10 and didn’t come close to the 8.0 point spread.

For the season, against the spread, Zoltar is 39-19 (~67% accuracy).

Just for fun, I track how well Zoltar does when just trying to predict just which team will win a game. This isn’t useful except for parlay betting. In week #11, just predicting the winning team, Zoltar went only 8-6 which isn’t very good — just slightly better than a coin flip. Vegas was quite good 11-3 at just predicting the winning team.

Zoltar sometimes predicts a 0-point margin of victory. There are four such games in week #12. In those situations, to pick a winner (only so I can track raw number of correct predictions) in the first few weeks of the season, Zoltar picks the home team to win. After that, Zoltar uses his algorithms to pick a winner.



My system is named after the Zoltar fortune teller machine you can find in arcades. Arcade Zoltar uses a crystal ball to make his predictions. Crystal balls were a common plot device in movies of the 1930s.

Left: In “The Black Camel” (1931), detective Charlie Chan solves the murder of an actress who was filming in Honolulu. The maid did it.

Center: In “Charlie Chan at Treasure Island” (1939), Chan solves a murder committed by Dr. Zodiac who was actually magician Fred Rhadini in disguise.

Right: In “Black Magic” (1944), Chan solves a murder committed at a seance where the bullet disappears and no gun is found. The weapon was a disguised cigar case with bullets made of frozen blood.


Posted in Zoltar | Leave a comment

Recap of the Fall 2022 MLADS Conference

I gave a technical talk titled “Simple Unsupervised Anomaly Detection Using a PyTorch Transformer Autoencoder” at the Fall 2022 Machine Learning Artificial Intelligence and Data Science (MLADS) conference. The MLADS conference is an internal event at the large tech company I work for, and so the conference wasn’t open to the public. The bottom line is that I learned a lot, enjoyed the event, and made some valuable connections.

The event ran from November 14-17, 2022 in Redmond, Washington.

In my talk, I started by describing unsupervised anomaly detection using a standard neural autoencoder. Next I explained Transformer Architecture (TA) as briefly as possible. Then I showed an example of anomaly detection using TA. I used one of my standard examples where the data represents employees and looks like:

 1   0.24   1 0 0   0.2950   0 0 1
-1   0.39   0 0 1   0.5120   0 1 0
 1   0.63   0 1 0   0.7580   1 0 0
-1   0.36   1 0 0   0.4450   0 1 0
 1   0.27   0 1 0   0.2860   0 0 1
. . .

The fields are sex (male = -1), age (divided by 100), city (one of three), income (divided by 100,000), and job type (one of three). The point is that the technique works with any type of data: Boolean, integer, float/real, categorical.

There were about 40 people in the physical audience and about 80 people watching online. The presentation was recorded and historically most people will watch the recorded version of the presentation in the days following the event.

I enjoy interacting with my work colleagues in person. There are a lot of very smart people and I always pick up interesting new ideas. When I present in person, I can pick up cues from attendees’ body language and voice characteristics, which helps me give a better presentation. But the main value I get from presenting at internal work conferences is making connections with people who have interesting problems where deep neural techniques can be useful.

In addition to my technical talk, I sat on a panel discussion titled “Meet a Data Scientist”. The panel had five employees (including me) with wildly varying backgrounds and job types. I posed a question to my fellow panelists: “What skills do you think are requirements for the Data Scientist job role?” I was fully expecting to hear SQL, R, and Python (for use with a library such as scikit or PyTorch) and maybe a few other skills. But to my complete surprise, not one of the other four panelists felt that SQL, R, or Python was an essential skill for a Data Scientist. Hmmm. For the work I do, I can’t imagine a Data Scientist not having at least a moderate knowledge of SQL, R, and Python.



When I give a technical presentation, I have to think carefully about what I wear. If I dress too nicely, I might lose credibility with hard core tech guys. If I dress too casually, I might lose credibility with people who interact with external traditional conservative customers such as banks and medical companies. Here are three images from a search for Buryat traditional clothing. Quite elaborate. I like the fancy hats. Buryats are an indigenous group in Siberia. There are roughly 500,000 Buryats.


Posted in Conferences | Leave a comment

Simple Numerical Optimization Using an Evolutionary Algorithm with C#

The goal of a numerical optimization problem is to find a vector of values that minimize some cost function. The most fundamental example is minimizing the Sphere Function f(x0, x1, .. xn) = x0^2 + x1^2 + .. + xn^2. The optimal solution is X = (0, 0, .. 0) when f(X) = 0.0.

An evolutionary algorithm loosely mimics biological crossover (combining two exisiting parent solutions to create a new child solution) and mutation (slighly modifying a solution in a random way). In high level pseudo-code:

create population of random solutions
loop many times
  pick two parent solutions
  create a child solution
  mutate child
  evaluate child
  replace weak solution in pop with child
end-loop
return best solution found

An evolutionary approach is a meta-heuristic, meaning there are dozens of ways to implement a specific algorithm. I set out to design and implement the simplest possible example for the Sphere Function with dim = 6.

For parent selection I keep the population of solution sorted from smallest error to largest and pick one parent from the top/best half of the population and one parent from the bottom/worst half.

For crossover, I pick a random location and produce a child that’s the left half of parent1 and the right half of parent2.

For mutation, I walk through each value in the child solution and flip a virtual coin. If the coin is heads I do nothing but if the coin is tails, I add or subtract a random value between -0.25 and +0.25.

I set up a population with 8 solutions and ran the evolutionary algorithm for 1,000 generations. The algorithm found a very good solution of X = (0.02, 0.05, 0.01, -0.00, 0.15, 0.04) with an error of .0297.

Evolutionary algorithms aren’t practical for most numerical optimization problems. The primary example is finding the best set of weights for a deep neural network. For NNs, you can use clever Calculus gradients, called back-propagation. But for problems where the error function doesn’t have gradients, evolutionary optimization can be useful.

Someday, when quantum computing becomes practical, evolutionary optimization might be useful for training huge neural networks that have trillions (or more) of weights.



In “Forbidden Planet” (1956), the crew of the United Planets Cruiser C-57D used a spherical “astrogator” for navigation. An excellent movie. Center: In the original “Star Wars” (1977), the spherical Death Star was a giant weaponized space station. A good movie. Right: In “It Came From Outer Space” (1953), benign aliens in a spherical spaceship crash into the Arizona desert and take the form of humans until they can repair their ship. An excellent movie.


Demo code. Replace “lt”, “gt” with Boolean operator symbols.

namespace EvoOptSimple
// .NET 6.0
{
  internal class EvoOptSimpleProgram
  {
    static void Main(string[] args)
    {
      Console.WriteLine("\nBegin evo optimization demo ");
      Console.WriteLine("Goal is Sphere function dim = 6 ");
      Console.WriteLine("Opt sln [0, 0, 0, 0, 0, 0], err = 0 ");

      Solver solver = new Solver(6, 8, seed: 0);
      Console.WriteLine("\nInitial population: ");
      solver.Show();

      Console.WriteLine("\nBegin search ");
      solver.Solve(1000);
      Console.WriteLine("Done ");

      Console.WriteLine("\nFinal population: ");
      solver.Show();

      Console.WriteLine("\nEnd demo ");
      Console.ReadLine();
    } // Main
  } // Program

  // -----------------------------------------------------------

  public class Solver
  {
    public int popSize;
    public int dim;

    public double minGene, maxGene;
    public Random rnd;

    public double[][] pop;
    public double[] errs;

    public double[] bestSoln;
    public double bestErr;

    public Solver(int dim, int popSize, int seed)
    {
      this.minGene = -5.0; this.maxGene = 5.0;
      this.rnd = new Random(seed);
      this.dim = dim;
      this.popSize = popSize;
      this.pop = new double[popSize][];
      for (int i = 0; i "lt" popSize; ++i)
        this.pop[i] = new double[dim];
      this.errs = new double[popSize];
      for (int i = 0; i "lt" popSize; ++i)
      {
        for (int j = 0; j "lt" dim; ++j)
          this.pop[i][j] = (this.maxGene - this.minGene) *
            this.rnd.NextDouble() + this.minGene;
        this.errs[i] = this.ComputeError(this.pop[i]);
      }

      Array.Sort(this.errs, this.pop);  // parallel sort

      this.bestSoln = new double[dim];
      for (int j = 0; j "lt" dim; ++j)
        this.bestSoln[j] = this.pop[0][j];
      this.bestErr = this.errs[0];

    } // ctor()

    public double ComputeError(double[] soln)
    {
      // Sphere
      double result = 0.0;
      for (int j = 0; j "lt" soln.Length; ++j)
        result += soln[j] * soln[j];
      return result;
    }

    public void Show()
    {
      for (int i = 0; i "lt" this.popSize; ++i)
      {
        for (int j = 0; j "lt" this.dim; ++j)
        {
          Console.Write(this.pop[i][j].ToString("F4")
            .PadLeft(9) + " ");
        }
        Console.WriteLine(" | " + this.errs[i].ToString("F4")
          .PadLeft(10));
      }
      Console.WriteLine("-----");
      for (int j = 0; j "lt" this.dim; ++j)
        Console.Write(this.bestSoln[j].ToString("F4").
          PadLeft(9) + " ");
      Console.WriteLine(" | " + this.bestErr.ToString("F4")
        .PadLeft(10));
    } // Show

    public int[] PickParents()
    {
      int first = rnd.Next(0, this.popSize / 2);  // top half
      int second = rnd.Next(this.popSize / 2, this.popSize);
      int flip = rnd.Next(0, 2);  // 0 or 1
      if (flip == 0)
        return new int[] { first, second };
      else
        return new int[] { second, first };
    }

    public double[] CrossOver(double[] parent1,
      double[] parent2)
    {
      int idx = this.rnd.Next(1, this.dim-1); 
      double[] child = new double[this.dim];
      for (int k = 0; k "lt" idx; ++k)
        child[k] = parent1[k];
      for (int k = idx; k "lt" this.dim; ++k)
        child[k] = parent2[k];
      return child;
    }

    public void Mutate(double[] soln)
    {
      double lo = -0.25;
      double hi = 0.25;
      for (int j = 0; j "lt" soln.Length; ++j)
      {
        int flip = this.rnd.Next(0, 2);  // 0 or 1
        if (flip == 1)
        {
          double delta = (hi - lo) * 
            this.rnd.NextDouble() + lo;
          soln[j] += delta;
        }
      }
    } // Mutate

    public void Solve(int maxGen)
    {
      for (int gen = 0; gen "lt" maxGen; ++gen)
      {
        // 1. make a child
        int[] parentIdxs = this.PickParents();
        double[] parent1 = this.pop[parentIdxs[0]];
        double[] parent2 = this.pop[parentIdxs[1]];
        double[] child = this.CrossOver(parent1, parent2);

        // 2. mutate and evaluate
        this.Mutate(child);
        double childErr = this.ComputeError(child);

        // 2b. new best?
        if (childErr "lt" this.bestErr)
        {
          if (gen "lt" 20 || gen "gt" 700)
            Console.WriteLine("New best soln found at gen " +
              gen);
          for (int i = 0; i "lt" child.Length; ++i)
            this.bestSoln[i] = child[i];
          this.bestErr = childErr;
        }

        // 3. replace a weak soln with child
        int idx = this.rnd.Next(this.popSize / 2, this.popSize);
        for (int j = 0; j "lt" this.dim; ++j)
          this.pop[idx][j] = child[j];
        this.errs[idx] = childErr;

        // 4. create immigrant
        double[] imm = new double[this.dim];
        for (int j = 0; j "lt" this.dim; ++j)
          imm[j] = (this.maxGene - this.minGene) * 
            this.rnd.NextDouble() + this.minGene;
        double immErr = this.ComputeError(imm);

        // 4b. new best?
        if (immErr "lt" this.bestErr)
        {
          if (gen "lt" 20 || gen "gt" 700)
            Console.WriteLine("New best soln (imm) at gen " +
              gen);
          for (int i = 0; i "lt" child.Length; ++i)
            this.bestSoln[i] = child[i];
          this.bestErr = childErr;
        }

        idx = this.rnd.Next(this.popSize / 2, this.popSize);
        this.pop[idx] = imm;
        this.errs[idx] = immErr;

        // 5. sort
        Array.Sort(this.errs, this.pop);

        if (gen == 500) Console.WriteLine(". . . ");
      } // each gen
    } // Solve()
  } // Solver
} // ns
Posted in Machine Learning | Leave a comment

Multi-Class Classification Using PyTorch 1.12.1-CPU on MacOS

I do most of my work on Windows OS machines. One morning I noticed that my MacBook laptop in my office was collecting dust so I figured I’d upgrade the existing PyTorch 1.10.0 to version 1.12.1 to make sure there were no breaking changes, and also to refresh my memory of working with MacOS. Switching between Windows and MacOS is easier for me if I stay in practice.

Windows          Mac
--------------------------------
Notepad          TextEdit
cmd              Terminal (bash)
Ctrl-c           Command-c
PrtScn key       Shift-Command-3
File Explorer    Finder
Chrome           Safari

I fired up my MacBook and then open a Terminal (bash) shell and I checked my existing Python 3.7.6 installation and it was good. Next I went to download.pytorch.org/whl/torch_stable.html and clicked on the link to the cpu/torch-1.12.1-cp37-non-macosx_10_9_x86_64.whl file which downloaded it. In the shell I uninstalled my existing PyTorch 1.10.1 with the command “pip uninstall torch”. Then I navigated to the Downloads directory and installed using the command “pip install torch-1.12.1-cp37-non-macosx_10_9_x86_64.whl”. Installation worked without any problems. Amazing.

To test the PyTorch installation, I did one of my standard multi-class classification demos. The goal is to predict a person’s political type (conservative = 0, moderate = 1, liberal = 2) from sex, age, state (Michigan, Nebraska, Oklahoma), and income. See the data and program for the Windows version at https://jamesmccaffrey.wordpress.com/2022/09/01/multi-class-classification-using-pytorch-1-12-1-on-windows-10-11/.

I copied the training and test data from the page-link above and saved as people_train.txt and people_test.txt. The data looks like:

 1,0.24,1,0,0,0.2950,2
-1,0.39,0,0,1,0.5120,1
 1,0.63,0,1,0,0.7580,0
-1,0.36,1,0,0,0.4450,1
. . .

The network definition looks like:

class Net(T.nn.Module):
  def __init__(self):
    super(Net, self).__init__()
    self.hid1 = T.nn.Linear(6, 10)  # 6-(10-10)-3
    self.hid2 = T.nn.Linear(10, 10)
    self.oupt = T.nn.Linear(10, 3)

    T.nn.init.xavier_uniform_(self.hid1.weight)
    T.nn.init.zeros_(self.hid1.bias)
    T.nn.init.xavier_uniform_(self.hid2.weight)
    T.nn.init.zeros_(self.hid2.bias)
    T.nn.init.xavier_uniform_(self.oupt.weight)
    T.nn.init.zeros_(self.oupt.bias)

  def forward(self, x):
    z = T.tanh(self.hid1(x))
    z = T.tanh(self.hid2(z))
    z = T.log_softmax(self.oupt(z), dim=1)  # NLLLoss() 
    return z
 

Anyway, after saving the data and PyTorch program, I ran the program and . . . it almost worked first time. I forgot to change the Windows “\\” file path separators to the Linux-based “/” separators. I made the changes and then the program worked. Minor miracle.



Predicting a horse race is either a multi-class classification problem or a ranking problem, depending on your point of view. Three fantastic old electric horse race games.

Left: Merit Electric Derby (UK) from the 1960s. A battery powered motor flicks one ball bearing in each track which knocks the horse up the incline. The process is random due to the physics involved. Strangely wonderful.

Center: Peers Hardy Horse Racing Derby (UK) from the 1990s. Each horse has an electric motor under the field, which attaches via magnets. Battery powered. The game is quite sophisticated for the time. An electronic board plays music and shows the final win-place-show results. Notice the clockwise direction — common in Europe but non-existent in the US.

Right: Tudor Electric Horse Race Game (US) from the 1960s. Made by the company best known for Electric Football. An electric motor vibrates the field which causes the horses to move forward. Some paths are shorter than others so the result is randomized.


Posted in PyTorch | Leave a comment