I give fairly detailed examples of the two approaches at https://jamesmccaffrey.wordpress.com/2020/06/11/pytorch-crossentropyloss-vs-nllloss-cross-entropy-loss-vs-negative-log-likelihood-loss/.

# log_soft_demo.py # Python 3.7.6 (Anaconda3-2020.02) # PyTorch 1.6.0 Windows 10 import torch as T device = T.device("cpu") print("\nBegin softmax and log_softmax() demo \n") t1 = T.tensor([1.0, 3.0, 2.0], dtype=T.float32).to(device) sm = T.nn.functional.softmax(t1, dim=0) lsm = T.nn.functional.log_softmax(t1, dim=0) l_sm = T.log(T.nn.functional.softmax(t1, dim=0)) T.set_printoptions(precision=4) print("tensor t1 = ", end=""); print(t1) print("softmax(t1) = ", end=""); print(sm) print("log_softmax(t1) = ", end=""); print(lsm) print("log(softmax(t1)) = ", end=""); print(l_sm) print("\nEnd demo ")

*I computed softmax() and log_softmax() and log(softmax) of [1.0, 3.0, 2.0] using Excel, and then again using PyTorch.*

Now on the one hand, this is all the information that is needed to implement a PyTorch multi-class classifier. But behind the scenes there are many details. These details can be confusing if you have a semi-theoretical knowledge of neural network — meaning, what about softmax() activation on the output nodes? Briefly, in theory you want to apply softmax() to the raw output nodes values (called “logits”) so that the sum of the output nodes is 1.0 and the values can be loosely interpreted as probabilities. Then you compare the pseudo-probabilities with the target output values. For example, a target output might be (0, 0, 1, 0) and the softmax computed output might be (0.1, 0.2, 0.6, 0.1). The differences between computed outputs and target outputs is then used to adjust the network weights so that the computed output values get better.

But PyTorch examples usually don’t use this approach. In turns out that computing softmax() is astonishingly difficult if you want to avoid arithmetic underflow or overflow. (Believe me, I’ve tried.) So, for the sake of engineering, PyTorch uses log_softmax() which significantly reduces the likelihood of arithmetic overflow (but unfortunately is still susceptible to underflow).

Somewhat unfortunately, the name of the PyTorch CrossEntropyLoss() is misleading because in mathematics, a cross entropy loss function would expect input values that sum to 1.0 (i.e., after softmax()’ing) but the PyTorch CrossEntropyLoss() function expects inputs that have had log_softmax() applied.

Put another way: computing softmax is error-prone. Computing log_softmax is less error-prone. Therefore PyTorch usually uses log_softmax, but this means you need the special NLLLoss() function. Because of this confusion, PyTorch combines the techniques into no activation plus CrossEntropyLoss() — which turns out to be even more confusing for beginers.

Details, details, details. But interesting, interesting, interesting.

*An artificial neural network is a crude approximation of biological neurons. Both real neurons and artificial neurons have a lot of interesting detail. If you’ve ever looked at a bird feather closely, you’ll have noticed the incredible amount of tiny details it has. Left: Real feather earrings on actress Tia Carrere. Center: Real feather earrings on actress Patricia Velasquez. Right: Artificial feather earrings on actress Angelina Jolie. Both the real and the artificial feathers are very interesting to me because of the detail.*

For a multi-class classification problem, you create a neural network that has the same number of output nodes as there are classes to predict. For example, if you are trying to predict a person’s political leaning of (conservative, moderate, liberal) based on things like age and income, you’d design a neural network with 3 output nodes. The output type is int64 to correspond to one-hot encoding such as a target output of (0, 1, 0). The output layer would use no activation because for training you use CrossEntropyLoss() which applies softmax automatically. The computed output is three values such as (2.345, -1.987, 4.5678) and the predicted class is the index of the largest output value, [2] in this case.

For a binary classification problem, you create a neural network that has one output node. The output is type float32. The output layer would use logistic-sigmoid activation so computed output is between 0 and 1. For training you use BinaryCrossEntropyLoss() which requires computed output to be between 0 and 1 and which does not apply sigmoid automatically. The computed output is a single value such as 0.345 and if the computed output is less than 0.5 the predicted class is 0 or if the computed output is greater than 0.5 the predicted class is 1.

I was pretty sure I could create a binary classifier using the multi-class approach. I created a network with two output nodes, and no output activation. For training I used CrossEntropyError() and so softmax is automatically applied during training.

Interestingly, for the dataset I experimented with (the Banknote authentication dataset) I got essentially identical results using the normal binary classification technique and using the modified multi-class classification approach.

Good experiment.

*There are many binary pairs. Good vs. evil. Virtue vs. sin. “Dr. Yen Sin” was an early pulp science fiction magazine. It ran for only three issues in 1936. Left: ‘The Mystery of the Dragon’s Shadow” was the featured story in Issue #1. Center: “The Mystery of the Golden Skull” was featured in Issue #2. Right: “The Mystery of the Singing Mummies” headlined the final Issue #3. It seems odd to me to base a magazine on a villain rather than a hero, but a good, evil villain is usually more interesting than a hero.*

```
Zoltar: jets by 3 dog = broncos Vegas: broncos by 2.5
Zoltar: ravens by 9 dog = redskins Vegas: ravens by 13.5
Zoltar: texans by 4 dog = vikings Vegas: texans by 4
Zoltar: seahawks by 4 dog = dolphins Vegas: seahawks by 7
Zoltar: bears by 6 dog = colts Vegas: colts by 2.5
Zoltar: titans by 5 dog = steelers Vegas: titans by 1.5
Zoltar: jaguars by 0 dog = bengals Vegas: bengals by 3
Zoltar: buccaneers by 6 dog = chargers Vegas: buccaneers by 7
Zoltar: saints by 5 dog = lions Vegas: saints by 5.5
Zoltar: cowboys by 5 dog = browns Vegas: cowboys by 5
Zoltar: cardinals by 0 dog = panthers Vegas: cardinals by 4
Zoltar: rams by 10 dog = giants Vegas: rams by 11.5
Zoltar: chiefs by 6 dog = patriots Vegas: chiefs by 7
Zoltar: bills by 0 dog = raiders Vegas: bills by 2.5
Zoltar: fortyniners by 8 dog = eagles Vegas: fortyniners by 6
Zoltar: packers by 11 dog = falcons Vegas: packers by 6
```

Zoltar theoretically suggests betting when the Vegas line is more than 3.0 points different from Zoltar’s prediction. For week #4 Zoltar has six hypothetical suggestions.

The teams that Zoltar likes in week #4 are:

1. Underdog NY Jets against the Broncos

2. Underdog Redskins against the Ravens

3. Underdog Bears against the Colts

4. Favorite Titans over the Steelers

5. Underdog Panthers against the Cardinals

6. Favorite Packers over the Falcons

*Note: From my human perspective, these predictions look terrible. The Jets are a very bad team. The Redskins are a very bad team and the Ravens are a good team. The Bears have been lucky so far. The Titans have been lucky so far. The Panthers have some key injuries. The Falcons have been very unlucky so far. I would never bet my own real money on these suggestions, except maybe the Packers. But we’ll see. Zoltar is impassionate and doesn’t fully understand “lucky” (except to the extent that he takes blowout wins into account.*

When you bet on an underdog, your bet pays off if the underdog wins by any score, or if the game is a tie, or if the favorite team wins but by more than the Vegas point spread. If the favorite team wins by exactly the point spread, the bet is a push. You lose your bet if the favorite wins by more than the Vegas point spread.

Theoretically, if you must bet $110 to win $100 (typical in Vegas) then you’ll make money if you predict at 53% accuracy or better. But realistically, you need to predict at 60% accuracy or better.

Just for fun, I track how well Zoltar does when just trying to predict just which team will win a game. This isn’t useful except for parlay betting.

Zoltar was weak in week #3. Against the Vegas point spread, Zoltar was 2-3. For the season, Zoltar is 12-8 (60.0%) against the spread. Just predicting winners, Zoltar was 11-4 which is pretty good. (There was one tie game, Eagles vs. the Bengals). Just picking winners, the Vegas line went 9-6 which isn’t very good.

*My system is named after the Zoltar fortune teller machine you can find in arcades. Coin-operated fortune telling machines have been around for decades. Here are three very old machines I found on the Internet.*

1. LS introduces a new hyperparameter, which makes a complex system more complex, and results less interpretable.

2. LS modifies data, which is conceptually offensive and problematic in practice.

3. You can achieve a roughly equivalent LS effect by using weight decay or L1/L2 regularization.

I’ll explain label smoothing by using an example. Suppose you create a neural network classifier where there are three possible outcomes, for example, the Iris dataset where the three species to predict are setosa or versicolor or virginica. Your training data might look like:

5.1, 3.5, 1.4, 0.2, 1, 0, 0 # setosa 7.0, 3.2, 4.7, 1.4, 0, 1, 0 # versicolor 6.3, 2.9, 5.6, 1.8, 0, 0, 1 # virginica . . .

The first four values on each line are predictors and the next three values are one-hot encoded species. An example of label smooting is to modify the training data to use “soft targets” like so:

5.1, 3.5, 1.4, 0.2, 0.8, 0.1, 0.1 # setosa 7.0, 3.2, 4.7, 1.4, 0.1, 0.8, 0.1 # versicolor 6.3, 2.9, 5.6, 1.8, 0.1, 0.1, 0.8 # virginica . . .

This label smoothing approach sometimes reduces model overfitting so that when the trained model is presented with new, previously unseen data, the prediction accuracy is better than if you don’t use label smoothing.

Here’s a brief, hand-waving argument of what happens when you use LS training data. First, without LS, imagine you are updating the middle output node and the target value is 1 and the computed output value is 0.75 — you want to increase the weights that are connected to the node so that the computed output will increase and get closer to the target of 1.

Regardless of whether you are using cross entropy error or mean squared error, a weight delta is computed using the calculus derivative of the error function, and that delta always contains the error term (target – output), which is (1 – 0.75) = 0.25. That error will be modified by the learning rate, so if the learning rate is 0.01 the delta will contain 0.25 * 0.01 = 0.0025 and the weight will increase slightly.

Now on the next training iteration, suppose the computed output is 0.97. The error term is (1 – 0.97) = 0.03 and the delta will contain 0.03 * 0.01 = 0.0003 and the weight will increase but only by a tiny amount.

The ultimate effect of this training approach is that weight values could get very large, and large weight values sometimes give an overfitted model.

Now, suppose you’re using label smoothing. If the computed output is 0.75, the error term is (target – output) = (0.8 – 0.75) = 0.05 and the weight delta will contain 0.05 * 0.01 = 0.0005 and the weight will increase, but only by a small amount. Now on the next iteration, if the computed output is 0.97 the error term is (0.8 – 0.97) = -0.17 and the delta will contain -0.17 * 0.01 = -0.0017 and the weight value will decrease slightly.

The ultimate effect of the label smoothing approach is that weight values are usually prevented from getting very large, which can help prevent model overfitting.

Let me emphasize that this hand-waving argument has left out many important details.

OK. First problem with label smoothing: Where did the (0.1, 0.8, 0.1) soft targets come from? Why not (0.15, 0.70, 0.15) or (0.2, 0.6, 0.2) or something else? There’s no good answer to this question. Mathematically, label smoothing is usually presented as:

t’ = (1-a) * t + (a/K)

where t’ is the soft target, t is the original hard target (0 or 1), K is the number of classes, and a is any value between 0.0 and 1.0. For example, if a = 0.10 and K = 3, then a hard target of 1 becomes (1 – 0.10) * 1 + (0.10 / 3) = 0.9333 and the two 0 hard targets become 0.0333 each.

But this apparently sophisticated math basis is a hoax because there’s no good way to choose the value of a. In other words, the label smoothing values can be whatever you want. Ugly.

The second problem with label smoothing is that because the effect of LS is to restrict the magnitude of weight values, there are other simpler techniques that do this, such as weight decay, L1 regularization, and L2 regularization. Now, it’s true that these techniques don’t work exactly the same as LS, but the general principle is the same.

Finally, the worst problem with label smoothing in my opinion is that you are changing data. Philosophically this is just ugly, ugly, ugly. It’s true that you don’t have to physically change the training data — instead you can programmatically change the hard target values to label smoothed soft target values during training. But modifying data is almost always just wrong.

Let me wrap up by saying that when I did my research on label smoothing for this blog post, I was horrified by what I found on the Internet. Almost every blog post and short article, and even many formal research papers, had significant errors.

For example, almost all references either imply or explicitly state that there’s a necessary relation between label smoothing and cross entropy error. This is not correct. You can use label smoothing with cross entropy error or mean squared error or any other kind of error. When you use some form of error, the back-propagation technique uses the calculus derivative of the error function, not the error function itself, to compute a weight update delta value. The weight update term for all error functions contains a (target – output) term, and that term is the only place where label smoothing comes into play. For details, see my post at https://jamesmccaffrey.wordpress.com/2019/09/23/neural-network-back-propagation-weight-update-equation-mean-squared-error-vs-cross-entropy-error/.

I also read several Internet label smoothing articles that talked about “confidence” and “calibration” that were complete technical nonsense.

Incidentally, label smoothing has been around since at least the mid 1980s when it wasn’t uncommon to use 0.9 and 0.1 instead of 1 and 0 for binary classification. This is exactly equivalent to label smoothing with K = 2 and a = 0.2. It seems like the technique was forgotten in the late 1990s but then was “rediscovered” in the mid 2010s.

*Thank you to my colleague Hyrum A. who pointed out a recent research paper that looked at label smoothing.*

*“Smooth douglasia” – a relatively rare wildflower that grows in the Pacific Northwest. “Smooth Operator” – a 1984 song by a British group called Sade. “Antelope Smooth Red Rock Canyon” – a beautiful slot canyon in Arizona. “Smooth haired dachshund” – originally bred in the early 1700s to hunt burrow-dwelling animals like badgers and rabbits. This dachshund puppy doesn’t look very threatening to burrow-dwelling animals or anything else.*

The recommended way to save a PyTorch model looks like:

import torch as T class Net(): # define neural network here def main(): net = Net() # create # train network path = ".\\Models\\my_model.pth" T.save(net.state_dict(), path) if __name__ == "__main__": main()

Then to use the saved model in another file:

import torch as T class Net(): # exactly the same as above def main(): print("\nLoad using state_dict approach (preferred)") path = ".\\Models\\my_model.pth" model = Net() model.load_state_dict(T.load(path)) # use the model to make predictions if __name__ == "__main__": main()

The older approach looks very similar:

class Net(): # define neural network here # save old way (not preferred) path = ".\\Models\\my_model.pth" T.save(net, path) # in another file: class Net(): # exactly the same as above path = ".\\Models\\my_model.pth" model = T.load(path)

You have to look at the code very carefully to see the differences between the old way and the newer state_dict approach. Notice that in both techniques, you must have the class definition of the neural network in the file that saves the model, and also in the file that loads the model.

I won’t try to explain why the newer state_dict approach is preferred — it’s really low-level details.

Just for fun, I coded up three complete working PyTorch programs to demonstrate. The first program creates a dummy neural network, computes an example output, and saves the model using both the state_dict way and also the older “full” way. The second program loads the state_dict model and computes an example output. The third program loads the older-format model and computes an example output. All three output values are the same.

In addition to saving a PyTorch model using the two ways I’ve explained here, you can also save a PyTorch model using the ONNX format, which I don’t recommend at this time. I’ll explain ONNX in another blog post sometime. Briefly, ONNX is new and still immature (so ONNX is not fully supported), and you can’t even run a saved ONNX model using PyTorch (you have to use an entirely different system to run the saved model).

*Three (fashion) models saved (via photography). The photos were taken by Nina Leen (1910 – 1995) who was a famous photographer and was best known for her contributions to Life Magazine. Life Magazine was one of the most important means of communication in the world, especially from the years 1936 – 1972. These three old photos of models from the 1950s hold up very well today in my opinion.*

]]>

An interesting news article caught my attention recently. DNA kinship analysis was used to solve a crime that took place 36 years ago. On November 22, 1984, a 14-year old girl named Wendy Jerome walked out of the door of her home in Rochester, New York after dinner at 7:00 PM to deliver a birthday card to her best friend who lived a few doors down the street.

Wendy’s body was found a few hours later behind a dumpster. She had been raped and then brutally beaten to death.

DNA matching did not exist in 1984. The first use of DNA matching in a criminal case occurred in 1986. But Rochester police saved Wendy’s clothes. Years later, DNA matching had become a common technique, but the unknown murderer’s DNA on Wendy’s clothes did not match any criminal in the CODIS database.

*Left: Wendy Jerome. Right: The murderer, Timothy Williams from a police booking photo (with watermarks).*

However, a recently developed technique, DNA kinship analysis, solved the crime in September 2020. Technicians analyzed the unknown murderer’s DNA and generated a list of criminals whose DNA was in CODIS and who were highly likely to be related to the murderer. This list quickly identified a suspect, Timothy Williams. Williams’ DNA was obtained, and it matched the DNA found on Wendy Jerome 36 years before, proving he was responsible for the crime. Williams is age 56 now so he was 20 years old when he raped and murdered Wendy Jerome.

The first DNA matching techniques, which were developed in the 1980s, are based on classical statistics, leading to statements like, “There is only one chance in 100 trillion that the DNA came from someone other than the suspect.” However, deep neural machine learning techniques are now being applied to DNA analysis, including kinship analysis.

Fascinating. Kinship analysis is part of a larger field of study called bioinformatics. I wish I knew more about bioinformatics, especially new techniques that use deep neural technologies. But with the Internet, I’m quite sure I’ll learn as time goes by.

This story illustrates incredible science — the best of humanity — and an evil person that represents the worst. It’s a good thing to bring criminals to justice, but I hope that some day machine learning and AI can be used to prevent crime before it happens.

]]>

Neural networks are very good at classification, for example predicting the species (setosa, versicolor, or virginica) of an iris flower, based on the flower’s petal length and width, and sepal length and width. And neural networks are quite good at some regression problems, such as predicting the median house price in a town based on the average size of houses in the town, the tax rate in the town, the nearness to the closest major city, and so on.

But neural networks are not really intended for ordinary math computations such as computing the area of a triangle based on base and height. In case your elementary school math is a bit rusty, I’ll remind you that the area of a triangle is 1/2 times the base times the height.

I work at a large tech company and PyTorch is the officially preferred neural network code library, as well as my personally preferred library. I decided to look at predicting the area of a triangle using PyTorch version 1.6, the current version on the weekend when I walking my dogs.

I wrote a program that programmatically generated 10,000 training examples where the base and height values were random values between 0.1 and 0.9 (and so the areas were between 0.005 and 0.405). I created a 2-(100-100-100-100)-1 neural network — 2 input nodes, four hidden layers with 100 nodes each, and a single output node. I used tanh activation on the hidden nodes, and no activation on the output nodes.

I trained the network using batches of 10 items for 1,000 epochs.

After training, the network correctly predicted 100% of the training items to within 10% of the correct area, and 100% of the training items to within 5% of the correct area, and 82% of the training items to with 1% of the correct area. Whether this is a good result or not depends upon your point of view.

Good fun. There’s a lot of buzz around deep learning and there’s a beehive of research activity on the topic. But it’s not magic.

*On the same weekend I was thinking about triangles, I watched an old 1967 spy movie called “Deadlier than the Male” featuring female assassins with beehive hair styles. Left: Actress Elke Sommer played the primary assassin. I have no idea how that hair style works. Center and Right: An Internet image search returned quite a few images like these, so I guess the beehive style is still sometimes used today.*

# triangle_area_nn.py # predict area of triangle using PyTorch NN import numpy as np import torch as T device = T.device("cpu") class TriangleDataset(T.utils.data.Dataset): # 0.40000, 0.80000, 0.16000 # [0] [1] [2] def __init__(self, src_file, num_rows=None): all_data = np.loadtxt(src_file, max_rows=num_rows, usecols=range(0,3), delimiter=",", skiprows=0, dtype=np.float32) self.x_data = T.tensor(all_data[:,0:2], dtype=T.float32).to(device) self.y_data = T.tensor(all_data[:,2], dtype=T.float32).to(device) self.y_data = self.y_data.reshape(-1,1) def __len__(self): return len(self.x_data) def __getitem__(self, idx): if T.is_tensor(idx): idx = idx.tolist() base_ht = self.x_data[idx,:] # idx rows, all 4 cols area = self.y_data[idx,:] # idx rows, the 1 col sample = { 'base_ht' : base_ht, 'area' : area } return sample # --------------------------------------------------------- def accuracy(model, ds): # ds is a iterable Dataset of Tensors n_correct10 = 0; n_wrong10 = 0 n_correct05 = 0; n_wrong05 = 0 n_correct01 = 0; n_wrong01 = 0 # alt: create DataLoader and then enumerate it for i in range(len(ds)): inpts = ds[i]['base_ht'] tri_area = ds[i]['area'] # float32 [0.0] or [1.0] with T.no_grad(): oupt = model(inpts) delta = tri_area.item() - oupt.item() if delta < 0.10 * tri_area.item(): n_correct10 += 1 else: n_wrong10 += 1 if delta < 0.05 * tri_area.item(): n_correct05 += 1 else: n_wrong05 += 1 if delta < 0.01 * tri_area.item(): n_correct01 += 1 else: n_wrong01 += 1 acc10 = (n_correct10 * 1.0) / (n_correct10 + n_wrong10) acc05 = (n_correct05 * 1.0) / (n_correct05 + n_wrong05) acc01 = (n_correct01 * 1.0) / (n_correct01 + n_wrong01) return (acc10, acc05, acc01) # ---------------------------------------------------------- class Net(T.nn.Module): def __init__(self): super(Net, self).__init__() self.hid1 = T.nn.Linear(2, 100) # 2-(100-100-100-100)-1 self.hid2 = T.nn.Linear(100, 100) self.hid3 = T.nn.Linear(100, 100) self.hid4 = T.nn.Linear(100, 100) self.oupt = T.nn.Linear(100, 1) T.nn.init.xavier_uniform_(self.hid1.weight) # glorot T.nn.init.zeros_(self.hid1.bias) T.nn.init.xavier_uniform_(self.hid2.weight) # glorot T.nn.init.zeros_(self.hid2.bias) T.nn.init.xavier_uniform_(self.hid3.weight) # glorot T.nn.init.zeros_(self.hid3.bias) T.nn.init.xavier_uniform_(self.hid4.weight) # glorot T.nn.init.zeros_(self.hid4.bias) T.nn.init.xavier_uniform_(self.oupt.weight) # glorot T.nn.init.zeros_(self.oupt.bias) def forward(self, x): z = T.tanh(self.hid1(x)) # or T.nn.Tanh() z = T.tanh(self.hid2(z)) z = T.tanh(self.hid3(z)) z = T.tanh(self.hid4(z)) z = self.oupt(z) # no activation return z # ---------------------------------------------------------- def main(): # 0. make training data file np.random.seed(1) T.manual_seed(1) hi = 0.9; lo = 0.1 train_f = open("area_train.txt", "w") for i in range(10000): base = (hi - lo) * np.random.random() + lo height = (hi - lo) * np.random.random() + lo area = 0.5 * base * height s = "%0.5f, %0.5f, %0.5f \n" % (base, height, area) train_f.write(s) train_f.close() # 1. create Dataset and DataLoader objects print("Creating Triangle Area train DataLoader ") train_file = ".\\area_train.txt" train_ds = TriangleDataset(train_file) # all rows bat_size = 10 train_ldr = T.utils.data.DataLoader(train_ds, batch_size=bat_size, shuffle=True) # 2. create neural network print("Creating 2-(100-100-100-100)-1 regression NN ") net = Net() # 3. train network print("\nPreparing training") net = net.train() # set training mode lrn_rate = 0.01 loss_func = T.nn.MSELoss() optimizer = T.optim.SGD(net.parameters(), lr=lrn_rate) max_epochs = 1000 ep_log_interval = 100 print("Loss function: " + str(loss_func)) print("Optimizer: SGD") print("Learn rate: 0.01") print("Batch size: 10") print("Max epochs: " + str(max_epochs)) print("\nStarting training") for epoch in range(0, max_epochs): epoch_loss = 0.0 # for one full epoch epoch_loss_custom = 0.0 num_lines_read = 0 for (batch_idx, batch) in enumerate(train_ldr): X = batch['base_ht'] # [10,4] base, height inputs Y = batch['area'] # [10,1] correct area to predict optimizer.zero_grad() oupt = net(X) # [10,1] computed loss_obj = loss_func(oupt, Y) # a tensor epoch_loss += loss_obj.item() # accumulate loss_obj.backward() optimizer.step() if epoch % ep_log_interval == 0: print("epoch = %4d loss = %0.4f" % \ (epoch, epoch_loss)) print("Done ") # 4. evaluate model net = net.eval() (acc10, acc05, acc01) = accuracy(net, train_ds) print("\nAccuracy (.10) on train data = %0.2f%%" % \ (acc10 * 100)) print("\nAccuracy (.05) on train data = %0.2f%%" % \ (acc05 * 100)) print("\nAccuracy (.01) on train data = %0.2f%%" % \ (acc01 * 100)) if __name__ == "__main__": main()]]>

```
Zoltar: jaguars by 6 dog = dolphins Vegas: jaguars by 3
Zoltar: steelers by 4 dog = texans Vegas: steelers by 3.5
Zoltar: patriots by 6 dog = raiders Vegas: patriots by 6.5
Zoltar: eagles by 10 dog = bengals Vegas: eagles by 6.5
Zoltar: browns by 6 dog = redskins Vegas: browns by 7
Zoltar: titans by 0 dog = vikings Vegas: titans by 2.5
Zoltar: bills by 4 dog = rams Vegas: bills by 3
Zoltar: fortyniners by 5 dog = giants Vegas: fortyniners by 4.5
Zoltar: bears by 0 dog = falcons Vegas: falcons by 3.5
Zoltar: colts by 5 dog = jets Vegas: colts by 10.5
Zoltar: chargers by 6 dog = panthers Vegas: chargers by 7
Zoltar: buccaneers by 0 dog = broncos Vegas: buccaneers by 6
Zoltar: cardinals by 6 dog = lions Vegas: cardinals by 6
Zoltar: seahawks by 6 dog = cowboys Vegas: seahawks by 5
Zoltar: packers by 0 dog = saints Vegas: saints by 3.5
Zoltar: ravens by 6 dog = chiefs Vegas: ravens by 3
```

Zoltar theoretically suggests betting when the Vegas line is more than 3.0 points different from Zoltar’s prediction. For week #3 Zoltar has five hypothetical suggestions. All of them are highly questionable because during weeks 1-3 Zoltar doesn’t have much data yet and with limited data, Zoltar likes underdogs.

The five teams (four underdogs, one favorite) that Zoltar likes in week #3 are:

1. Zoltar likes the Vegas favorite Eagles over the Bengals.

2. Zoltar likes the Vegas underdog Bears against the Falcons.

3. Zoltar likes the Vegas underdog Jets against the Colts.

4. Zoltar likes the Vegas underdog Broncos against the Buccaneers.

5. Zoltar likes the Vegas underdog Packers against the Saints.

*Note: I’ve clearly got some bad data or a bug in Zoltar — there’s no way that Zoltar should favor the winless NY Giants over the excellent SF 49ers team. I’ll have to tear apart my data when I get a chance.*

*Another update: Argh! I messed up my data files completely. I’ll need to rerun predictions for weeks 1 – 3.*

*Update: I’ve rerun Zoltar’s predictions. The ones posted here were made using correct data.*

When you bet on an underdog your bet pays off if the underdog wins by any score, or if the game is a tie, or if the favorite team wins but by less than the Vegas point spread. You lose your bet only if the favorite team wins by more than the Vegas point spread. If the favorite team wins by exactly the point spread, the bet is a push.

Theoretically, if you must bet $110 to win $100 (typical in Vegas) then you’ll make money if you predict at 53% accuracy or better. But realistically, you need to predict at 60% accuracy or better.

Just for fun, I track how well Zoltar does when just trying to predict just which team will win a game. This isn’t useful except for parlay betting.

Zoltar did poorly in week #2. Against the Vegas point spread, Zoltar was only 3-3. Darn.

Just predicting winners, Zoltar was an excellent 14-2. Just picking winners, the Vegas line also went 14-2 which is the best one-week result for Vegas I can ever remember.

*Left: My system is named after the Zoltar fortune teller machine you can find in arcades. Center and Right: Fortune teller machines have been around for decades. Here are two old ones I found on the Internet.*

I think we’re in the very early stage of deep learning. Perhaps the development of quantum computing will be the jet engine of deep learning.

Shown below are six of the U.S. land-based fighter planes that were already in production. The “P” stands for “pursuit” (fighter) and “XP” stands for “experimental pursuit”.

Top row. Left: Lockheed P-38 Lightning. Center: Bell P-39 Airacobra. Right: Curtiss P-40 Warhawk.

Bottom row. Left: Republic P-47 Thunderbolt. Center: North American P-51 Mustang. Right: Vought F4U Corsair (originally intended for aircraft carrier use, switched to land-based).

**1. Curtiss XP-46 (1941)** – Intended to be a successor to the existing P-40 plane, but it’s performance wasn’t better than the P-40D model.

**2. Grumman XP-50 (1941)** – Not ordered for production but the design evolved into the successful F7F Tigercat.

**3. Bell XP-52 (1941)** – Advanced design that would have featured contra-rotating pusher and swept wings. Canceled because of other higher priority designs, including the P-59 Airacomet jet plane. (The XP-52 is the only plane listed that didn’t have at least one prototype built, but it looked too cool to leave out.)

**4. Vultee XP-54 (1943)** – Did not exceed the performance of existing production aircraft.

**5. Curtiss XP-55 (1943)** – It’s performance did not meet expectations.

**6. Northrop XP-56 (1943)** – Proved to be an unstable design.

**7. Curtiss XP-60 (1941)** – Intended to be a successor to the existing P-40. Development not pursued because of other war-time production priorities.

**8. Curtiss XP-62 (1943)** – Had good performance but development was not pursued because of other, higher priority efforts.

**9. McDonnell XP-67 (1944)** – Very unusual design but only had performance equivalent to existing aircraft already in production.

**10. Republic XP-72 (1944)** – Had excellent performance but attention had turned to the first jet-powered aircraft.

**11. Fisher XP-75 (1943)** – Twin contra-rotating propellers. Performed well but not significantly better than existing P-51 already in production.

**12. Bell XP-77 (1944)** – Explored the idea of a very small, very lightweight design. Ultimately, large, heavy designs proved to be much better.

**13. Vultee XP-81 (1945)** – Combined two small jets with a regular engine. Excellent performance but by the time it first flew, it was clear that fully jet-powered planes were the future.

]]>

For example, if there are 3 classes then a target might be (0, 1, 0) and a computed output might be (0.10, 0.70, 0.20), and the squared error would be (0 – 0.10)^2 + (1 – 0.70)^2 + (0 – 0.20)^2.

Now fast forward several years and the PyTorch library. Weirdly, I couldn’t find any examples of multi-class classification using the traditional approach. Instead all the examples used ordinal encoding for the training data, and no activation on the output nodes, and CrossEntropyLoss() during training. It was quite digitally mysterious to me.

After many hours of experimentation I figured out was going on. But that would takes a ton of explanation. I sat down one day to implement a PyTorch multi-class classifier using the old, traditional approach.

I used the Iris Dataset example. First I created training and test data where the species-to-predict was one-hot encoded. The data looks like:

5.1, 3.5, 1.4, 0.2, 1, 0, 0 5.6, 3.0, 4.5, 1.5, 0, 1, 0 6.5, 3.2, 5.1, 2.0, 0, 0, 1 . . .

Next I coded a 4-7-3 neural network that had softmax() activation on the output nodes. Then I coded training using the MSELoss() function.

Interestingly, even though everything worked, the results weren’t quite as good as the now-normal ordinal encoding, no-activation, CrossEntropyLoss() approach in the sense that training took a bit longer to get good results.

After I finished my experiment, I realized that there’s an alternative approach. Instead of creating a file of training data where the labels-to-predict are one-hot encoded such as (0, 0, 1, 0), I could use a file where the labels are ordinal encode such as 2, and then write a Dataset class that reads the ordinal encoded data and then converts it to one-hot encoding. When I get some time, I’ll try that approach out and post my comments.

Well, that was a very satisfying experiment. I’m always pleased when I figure out something new. It’s very much like solving a puzzle.

*Three interesting mixed media images related to “digitally mysterious”, at least according to a Google image search. I’m not a big fan of ordinary photography as art, or ordinary digital art, but when digital and photography are combined, sometimes the results can be appealing.*

# iris_nll_loss.py # one-hot + softmax + MSELoss (traditional approach) # PyTorch 1.6.0-CPU Anaconda3-2020.02 Python 3.7.6 # Windows 10 import numpy as np import torch as T device = T.device("cpu") # apply to Tensor or Module # ----------------------------------------------------------- class IrisDataset(T.utils.data.Dataset): def __init__(self, src_file, num_rows=None): # 5.0, 3.5, 1.3, 0.3, 1, 0, 0 # . . . self.data = np.loadtxt(src_file, max_rows=num_rows, usecols=range(0,7), delimiter=",", skiprows=0, dtype=np.float32) self.num_rows=num_rows # not essential def __len__(self): return len(self.data) def __getitem__(self, idx): if T.is_tensor(idx): idx = idx.tolist() preds = T.tensor(self.data[idx, 0:4], dtype=T.float32).to(device) spcs = T.tensor(self.data[idx, 4:7], dtype=T.float32).to(device) sample = { 'predictors' : preds, 'species' : spcs } return sample # ----------------------------------------------------------- class Net(T.nn.Module): def __init__(self): super(Net, self).__init__() self.hid1 = T.nn.Linear(4, 7) # 4-7-3 self.oupt = T.nn.Linear(7, 3) T.nn.init.xavier_uniform_(self.hid1.weight) T.nn.init.zeros_(self.hid1.bias) T.nn.init.xavier_uniform_(self.oupt.weight) T.nn.init.zeros_(self.oupt.bias) def forward(self, x): z = T.tanh(self.hid1(x)) z = T.nn.functional.softmax(self.oupt(z), dim=1) # rows return z # ----------------------------------------------------------- def accuracy(model, dataset): # assumes model.eval() dataldr = T.utils.data.DataLoader(dataset, batch_size=1, shuffle=False) n_correct = 0; n_wrong = 0 for (_, batch) in enumerate(dataldr): X = batch['predictors'] Y = T.flatten(batch['species']) oupt = model(X) # logits form comp_idx = T.argmax(oupt) targ_idx = T.argmax(Y) if comp_idx == targ_idx: n_correct += 1 else: n_wrong += 1 acc = (n_correct * 100.0) / (n_correct + n_wrong) return acc # ----------------------------------------------------------- def main(): # 0. get started print("\nBegin Iris with MSELoss demo \n") T.manual_seed(1) np.random.seed(1) # 1. create DataLoader objects print("Creating Iris train and test DataLoader ") train_file = ".\\Data\\iris_train_hot.txt" test_file = ".\\Data\\iris_test_hot.txt" train_ds = IrisDataset(train_file, num_rows=120) test_ds = IrisDataset(test_file) bat_size = 10 train_ldr = T.utils.data.DataLoader(train_ds, batch_size=bat_size, shuffle=True) test_ldr = T.utils.data.DataLoader(test_ds, batch_size=1, shuffle=False) # 2. create network net = Net().to(device) # 3. train model max_epochs = 20 ep_log_interval = 2 lrn_rate = 0.12 loss_func = T.nn.MSELoss(reduction='mean') # assumes softmax optimizer = T.optim.SGD(net.parameters(), lr=lrn_rate) print("\nbat_size = %3d " % bat_size) print("loss = " + str(loss_func)) print("optimizer = SGD") print("max_epochs = %3d " % max_epochs) print("lrn_rate = %0.3f " % lrn_rate) print("\nStarting training") net.train() for epoch in range(0, max_epochs): epoch_loss = 0 # for one full epoch num_lines_read = 0 for (batch_idx, batch) in enumerate(train_ldr): # print(" batch = " + str(batch_idx)) X = batch['predictors'] # [10,4] Y = batch['species'] # num_lines_read += bat_size # early exit optimizer.zero_grad() oupt = net(X) loss_obj = loss_func(oupt, Y) # a tensor epoch_loss += loss_obj.item() # accumulate loss_obj.backward() optimizer.step() if epoch % ep_log_interval == 0: print("epoch = %4d loss = %0.4f" % (epoch, epoch_loss)) print("Done ") # 4. evaluate model accuracy print("\nComputing model accuracy") net.eval() acc = accuracy(net, test_ds) # item-by-item print("Accuracy on test data = %0.2f%%" % acc) # 5. make a prediction np.set_printoptions(precision=4) print("\nPredicting species for [6.1, 3.1, 5.1, 1.1]: ") unk = np.array([[6.1, 3.1, 5.1, 1.1]], dtype=np.float32) unk = T.tensor(unk, dtype=T.float32).to(device) probs = net(unk) print(probs) # 6. save model print("\nSaving trained model ") fn = ".\\Models\\iris_model.pth" T.save(net.state_dict(), fn) print("\nEnd Iris demo") if __name__ == "__main__": main()

]]>