The fundamental data structure in PyTorch is the tensor. A PyTorch tensor is a one-dimensional (i.e., a vector) or multidimensional (i.e., a matrix) that can be handled by a GPU.

Working with PyTorch tensors can be mildly frustrating for beginners. Based on my learning path, I think it’s important to have a solid understanding of tensor basics, starting with different techniques for tensor creation.

Here’s the introductory example I use when I teach a PyTorch workshop.

The statement:

x = pt.tensor([[0,0,0],[0,0,0]], dtype=pt.float32)

creates a 2×3 tensor of zeros where each cell is a 32-bit floating point value. Notice the lower-case ‘t’. I like to use “pt” as an alias but almost all my colleagues spell out “torch”.

The statement:

x = pt.zeros(2, 3, dtype=pt.float32)

is a shortcut to do the same thing using the special zeros() function. I’m not a fan of most programming language shortcuts like this.

Here’s a third way to do the same thing:

x = pt.FloatTensor([[0,0,0],[0,0,0]])

which is a different shortcut. Other shortcuts include functions like LongTensor().

And now a fourth way is:

x = pt.Tensor([[0,0,0],[0,0,0]])

which is based on the idea that 32-bit float is the default numeric type for neural networks.

A fifth way, which is a bit different, is to create a PyTorch tensor from a NumPy array like:

a = np.array([[0,0,0],[0,0,0]], dtype=np.float32)

x = pt.from_numpy(a)

The from_numpy() function is especially useful when reading data from a text file using np.loadtxt().

The moral of the story is that when learning PyTorch, you have to move slower than you’d like because mastering tensor basics is a bit trickier than you might expect.

]]>

```
Zoltar: chiefs by 6 dog = chargers Vegas: chiefs by 3.5
Zoltar: broncos by 6 dog = browns Vegas: broncos by 3
Zoltar: texans by 3 dog = jets Vegas: texans by 6
Zoltar: falcons by 6 dog = cardinals Vegas: cardinals by 3
Zoltar: lions by 0 dog = bills Vegas: bills by 2.5
Zoltar: bears by 6 dog = packers Vegas: bears by 5.5
Zoltar: bengals by 6 dog = raiders Vegas: bengals by 3
Zoltar: cowboys by 1 dog = colts Vegas: colts by 3
Zoltar: jaguars by 4 dog = redskins Vegas: jaguars by 7
Zoltar: vikings by 6 dog = dolphins Vegas: vikings by 7
Zoltar: titans by 2 dog = giants Vegas: giants by 2.5
Zoltar: ravens by 7 dog = buccaneers Vegas: ravens by 8
Zoltar: seahawks by 5 dog = fortyniners Vegas: seahawks by 6.5
Zoltar: patriots by 0 dog = steelers Vegas: patriots by 3
Zoltar: rams by 7 dog = eagles Vegas: rams by 9.5
Zoltar: saints by 4 dog = panthers Vegas: saints by 6.5
```

Note: There’s some weirdness with the early Vegas point spreads for Arizona at Atlanta (no line), Dallas at Indianapolis (no line), and New England at Pittsburgh (no line). I’ll update this post when I figure out what’s going on.

Zoltar theoretically suggests betting when the Vegas line is more than 3.0 points different from Zoltar’s prediction. For week #15, before the point spread updates, Zoltar has just one hypothetical suggestion.

1. Zoltar likes the Vegas underdog Titans against the Giants. Zoltar thinks the Titans are 2 points better than the Giants but Vegas has the Giants as 2.5 point favorites. So, a bet on the Titans will pay off if the Titans win (by any score) or if the Giants win but by less than 2.5 points (in other words, 2 points or 1 point).

Theoretically, if you must bet $110 to win $100 (typical in Vegas) then you’ll make money if you predict at 53% accuracy or better. But realistically, you need to predict at 60% accuracy or better.

Just for fun, I track how well Zoltar does when just trying to predict just which team will win a game (not by how many points). This isn’t useful except for parlay betting.

Zoltar sometimes predicts a 0-point margin of victory. There are two such game in week #15: Lions vs. Bills and Patriots vs. Steelers. In the first four weeks of the season, Zoltar picks the home team to win. After week #4, Zoltar uses historical data for the current season (which usually, but not always, ends up in a prediction that the home team will win).

==

Zoltar did rather poorly in week #14. Against the Vegas point spread, which is what Zoltar is designed to do, Zoltar went 1-1 . . . sort of. I botched one prediction, Redskins vs. Giants when I didn’t notice a key injury at the Redskins quarterback position, because I was travelling to a conference. So I give myself a Mulligan on that one game, otherwise Zoltar was 1-2 against the spread.

For the season so far, against the Vegas spread, Zoltar is 42-25 which is about 62% accuracy.

Just predicting winners, Zoltar was a poor 8-8. Vegas was also 8-8. I believe this was the first week this season where Zoltar and Vegas completely agreed on just who’d win (even though both have had the same record in some weeks). For the season, just predicting which team will win, Zoltar is 141-65 (about 68% accuracy) and Vegas is 138-66 (also about 68% accuracy).

*My system is name after the Zoltar fortune teller machine you can find in arcades. There are many variations, but I like Zoltar the best.*

Two players each pick a spinner and spin it. The spinner that lands on the higher number wins.

Suppose the two spinners are A and B. There are 16 equally likely possible outcomes. Spinner A wins on 9 of the 16 outcomes: 2-1, 4-1, 4-3, 4-1, 4-3, 9-1, 9-3, 9-8, 9-8. Therefore spinner A is better than spinner B.

Suppose the two spinners are B and C. Spinner B wins on 10 of the 16 possibilities: 1-0, 3-0, 8-0, 8-5, 8-6, 8-7, 8-0, 8-5, 8-6, 8-7. Therefore spinner B is better than spinner C.

Now because A is better than B and B is better than C, A must be much, much better than C right? Wrong! Spinner C wins against spinner A on 9 of 16 possibilities: 5-2, 5-4, 5-4, 6-2, 6-4, 6-4, 7-2, 7-4, 7-4). Amazing!

This is an example of non-transitive spinners. If the numbers were on four-sided dice instead of four-quadrant spinners you’d have the same situation.

How did I come up with this example? I wrote a short brute force program that randomly generated spinner data and checked if A > B and B > C and C > A. There were a couple of details but my little program quickly found an example that I edited slightly to make it a bit prettier.

*There’s something very joyous about a girl spinning to celebrate life. But a guy spinning in a kilt — no, no, no.*

There are five main algorithms for setting the initial values of the weights. In the early days, for neural networks with a single hidden layer, the two most widely used algorithms were uniform and normal. These two algorithms didn’t work well with deep neural networks and so Glorot uniform and Glorot normal were devised. Neither of these worked well with very deep neural networks that use ReLU activation and so He initialization was devised.

In code, for an input-to-hidden layer where ni is the number of input nodes and nh is the number of hidden nodes, uniform and normal initialization look like:

lo = -0.01; hi = +0.01 for i in range(self.ni): for j in range(self.nh): self.ih_weights[i,j] = np.float32(self.rnd.uniform(lo, hi))

mu = 0.00; sd = 0.10 for i in range(self.ni): for j in range(self.nh): self.ih_weights[i,j] = np.float32(self.rnd.normal(mu, sd))

The main problem with uniform and normal initialization is that you have to pick values for lo and hi (uniform) or mean and stddev (normal).

Code for Glorot uniform and Glorot normal could look like:

fin = self.ni; fout = self.nh sd = math.sqrt(6.0 / (fin + fout)) for i in range(self.ni): for j in range(self.nh): self.ih_weights[i,j] = np.float32(self.rnd.uniform(-sd, sd))

fin = self.ni; fout = self.nh sd = math.sqrt(2.0 / (fin + fout)) for i in range(self.ni): for j in range(self.nh): self.ih_weights[i,j] = np.float32(self.rnd.normal(0.0, sd))

Here fin stands for “fan in” and fout stands for “fan out”. Glorot initializations are also called Xavier initializations because, even though the ideas were well known, they were popularized in a research paper written by a researcher named Xavier Glorot.

Code for He initialization (after a paper by He, Zhang, Ren, Sun and so sometimes called He et al. initialization) could look like:

fin = self.ni sd = math.sqrt(2.0 / fin) for i in range(self.ni): for j in range(self.nh): self.ih_weights[i,j] = np.float32(self.rnd.normal(0.0, sd))

He initialization was designed strictly for use with layers that have ReLU initialization, but the algorithm can be used on any layer.

The moral of the story is that there are roughly 100 topics involved with a more-or-less complete understanding of neural networks. Weight initialization is one of these fundamental topics. When I was learning about neural networks, it seemed like there was always one more topic, but eventually I was able to connect all the dots.

*Current first lady Melania Trump wearing dots. Duchess of Cambridge Kate Middleton wearing dots. Both women are examples of elegance and class. And then former first lady Michelle Obama in dots. The term polka dots probably originated in the late 1930s during the “polka craze” in the U.S.*

An autoencoder is a special type of neural network and is probably best explained by an example. Some training data for a regular neural network might look like:

5.1, 3.5, 1.4, 0.2, setosa 7.0, 3.2, 4.7, 1.4, versicolor 6.3, 3.3, 6.0, 2.5, virginica . . .

This is the famous iris data where the first four values on each line are predictor values and the last value is the species. You could set up a 4-7-3 neural network. After training you could use the trained neural network model to predict the species of a new, previously unseen flower by feeding predictor values such as:

unknown = np.array([[6.1, 3.1, 5.1, 1.1]], dtype=np.float32) predicted = model.predict(unknown)

The training data for an autoencoder might look like:

5.1, 3.5, 1.4, 0.2, 5.1, 3.5, 1.4, 0.2 7.0, 3.2, 4.7, 1.4, 7.0, 3.2, 4.7, 1.4 6.3, 3.3, 6.0, 2.5, 6.3, 3.3, 6.0, 2.5 . . .

The first four values are predictors but now the goal is to predict the four inputs. If you set up a 4-2-4 neural network and trained it to predict its own inputs, you indirectly create a compressed version of the data with the values in the two hidden nodes. Put another way, the dimensionality of the data has been reduced from 4 to 2.

*Data with 16 dimensions has been reduced to two dimensions so it can be graphed as x-y data.*

*Diagram shows a 6-3-2-3-6 autoencoder architecture to reduce data with 6 dimensions (the inputs) down to 2 dimensions (the two middle nodes.*

In my article I demonstrate how to use an autoencoder to reduce/compress a data set that has 16 input values down to just 2 values. These two values can be used to create a graph of the data.

Using an autoencoder to reduce the dimensionality of data so that the data can be graphed in two dimensions is a common technique in machine learning. But the mechanism used for autoencoders also is used inside complex neural networks too.

]]>

```
Zoltar: titans by 6 dog = jaguars Vegas: titans by 4.5
Zoltar: bills by 6 dog = jets Vegas: bills by 3.5
Zoltar: rams by 4 dog = bears Vegas: rams by 4
Zoltar: panthers by 4 dog = browns Vegas: panthers by 1
Zoltar: falcons by 0 dog = packers Vegas: packers by 6
Zoltar: texans by 6 dog = colts Vegas: texans by 4.5
Zoltar: chiefs by 6 dog = ravens Vegas: chiefs by 7.5
Zoltar: patriots by 5 dog = dolphins Vegas: patriots by 8
Zoltar: saints by 7 dog = buccaneers Vegas: saints by 8
Zoltar: redskins by 6 dog = giants Vegas: redskins by 1.5
Zoltar: chargers by 10 dog = bengals Vegas: chargers by 15
Zoltar: broncos by 3 dog = fortyniners Vegas: broncos by 6
Zoltar: lions by 0 dog = cardinals Vegas: lions by 2
Zoltar: cowboys by 2 dog = eagles Vegas: cowboys by 4
Zoltar: steelers by 9 dog = raiders Vegas: steelers by 11.5
Zoltar: vikings by 0 dog = seahawks Vegas: seahawks by 3
```

Zoltar theoretically suggests betting when the Vegas line is more than 3.0 points different from Zoltar’s prediction. For week #14 Zoltar has three hypothetical suggestions.

1. Zoltar likes the Vegas underdog Falcons against the Packers. Zoltar thinks the two teams are exactly evenly matched, but Vegas has the Packers favored by 6.0 points. So, a bet on the Falcons will pay off if the Falcons win (by any score) or if the Packers win but by less than 6.0 points (in other words, 5 points or less).

2. Zoltar likes the Vegas favorite Redskins against the Giants. Zoltar thinks the Redskins are 6 points better than the Giants but Vegas has the Redskins favored only by 1.5 points. Therefore, Zoltar thinks the Redskins will win by 2 or more points and “cover the spread” as the phrase goes. Update: The Redskins quarterback is out and the Vegas point spread has moved to Giants favored by 3.0 points — Zoltar has no recommendation on his game.

3. Zoltar likes the Vegas underdog Bengals against the Chargers, thinking that the Chargers will win but will not cover the big 15.0 Vegas points.

Theoretically, if you must bet $110 to win $100 (typical in Vegas) then you’ll make money if you predict at 53% accuracy or better. But realistically, you need to predict at 60% accuracy or better.

Just for fun, I track how well Zoltar does when just trying to predict just which team will win a game (not by how many points). This isn’t useful except for parlay betting.

Zoltar sometimes predicts a 0-point margin of victory. There are three such game in week #14: Falcons vs. Packers, Lions vs. Cardinals, and Vikings vs. Seahawks.

==

Zoltar was pretty good in week #13. Against the Vegas point spread, which is what Zoltar is designed to do, Zoltar went 4-2 (which would win money as I explained above). For the season so far, against the Vegas spread Zoltar is 41-24 which is about 63% accuracy.

Just predicting winners, Zoltar was a mediocre 10-6. Vegas was also 10-6. For the season, just predicting which team will win, Zoltar is 133-57 (70% accuracy) and Vegas is 130-58 (about 69% accuracy).

*My system is named after the Zoltar fortune teller machine you can find in arcades (left). Coin-operated fortune telling machines have been around for a very long time.*

So, the idea is to get everything more or less working on your local machine — because you can be very efficient on a machine sitting at your feet. Then, after your small scale program is acceptable you provision some Cloud resources, copy your full training data and Python program up to the Cloud, and run your training program (or multiple versions of the program with different hyperparameter settings) in the Cloud, which will be very fast.

I walked through a tutorial at:

https://docs.microsoft.com/en-gb/azure/batch-ai/quickstart-tensorflow-training-cli

The tutorial show how to prepare Azure AI Batch, copy files up into Azure, run the job, and fetch the results.

~$ az group create --name myResourceGroup --location eastus2 ~$ az batchai workspace create . . . # create workspace ~$ az batchai cluster create . . . # create cluster ~$ az storage account create . . . # create storage account ~$ az storage share create . . . # create file share ~$ az storage directory create . . . # directory for scripts ~$ az storage directory create . . . # directory for logs ~$ az storage file upload . . . # upload neural program ~$ az batchai experiment create . . # create experiment ~$ (create job.json config file) ~$ az batchai job create . . . # create a job ~$ az batchai job file stream . . . # monitor job progress ~$ az storage file list . . . # show output files ~$ az storage file download . . . # fetch an output file ~$ az batchai cluster resize . . . # clean/save cluster ~$ az batchai cluster delete . . . # delete cluster ~$ az group delete . . . # delete group

You launch an Azure Cloud Shell, which can be based on either Windows PowerShell or Unix Bash shell and then issue commands, like the ones above. You use some commands that start with “az” which are used to configure Azure resources, and some commands that start with “az batchai” which are used to create and run the batch jobs.

My initial impression is that the process has a *lot* of steps. The tutorial has approximately 17 steps and there’s quite a bit that can go wrong with many of these steps, and the overall process is quite slow. Also, Batch AI performs everything on Ubuntu Linux. For developers who are used to Windows, this create a lot of friction for even common simple tasks such as file editing (using vi or nano instead of Notepad), navigating through the file system, and so on.

An alternative to Azure Batch AI is to use a Virtual Machine in the Cloud. I’m a big fan of VMs and prefer to use them when possible, but I can imagine scenarios where the Batch AI approach would be very useful.

Google has a similar “Google Cloud Shell” too, but I haven’t used either Google Cloud Shell or Azure Cloud Shell enough to make an informed opinion of how they compare. My hunch is that both cloud shells are probably quite similar — in the end batch AI just means copying files to the Cloud and running a Python script, so it’s not rocket science.

]]>

The game is mostly like regular blackjack. Each player gets two cards and the dealer gets one card face down and one card face up. But before the first player decides to hit or stand, the dealer peeks at the face down card to see what it is, then places the face-down and face-up cards on one of three positions that tell the players whether the face-down card is low (2, 3, 4, 5), medium (6, 7, 8, 9), or high (10, J, Q, K, A).

Interesting! For example, suppose you have a total of 18 and the dealer’s up-card is an 9 and the dealer’s down-card is high. Therefore, you absolutely know that the dealer has 19 or 20 and so you’d take a hit. Even though you’ll still likely lose, at least you have a small chance to draw an A, 2 or 3.

The game leads to some surprisingly tricky decisions. Suppose you have a total of 11 which is normally an automatic double-down situation. But if the dealer has a 10 showing and the down-card is “high”, you shouldn’t double-down.

I played the game for about 20 minutes. I would have had a great time, but there were two other guys at the table. They were idiots. They played basic strategy and didn’t adjust their hit-stand decision on the huge amount of information the low, medium, high knowledge gives you. I was gritting my teeth on every hand, just waiting to see what dumb decision they’d make next.

*The well-known wizard-of-odds Web site analyzes many gambling games. Here’s the site’s strategy analysis for Down Under Blackjack.*

I enjoyed playing Down Under Blackjack (even with the two idiots at the table). But I noticed an unintended, negative side effect in the game. Because players have more information, they hit more often and lose by busting more often than in regular blackjack. I noticed there’s a different psychology if you lose when you stand (don’t draw a card) and the dealer’s hand just beats you, than when you hit and lose by going over 21 and busting.

If the dealer’s hand beats you, well, in your mind, that’s just bad luck. But if you draw a card and bust, psychologically, it’s you who made the wrong decision. In other words, if you lose by busting it hurts more than losing by standing and the dealer beating your hand. Psychologists call this the errors of omission versus errors of commission effect. Because of the way Down Under Blackjack works, players lose by busting more often compared to regular blackjack. So, even though Down Under Blackjack is really a lot of fun, I don’t think it will last because of the subtle psychology.

Math is fascinating. Psychology is fascinating. Put them together in Las Vegas and you get an interesting city.

*When I worked at a Marriott Hotel in Newport Beach while I was in college at U.C. Irvine, I learned that clubs and restaurants involve a lot of math and psychology. These waitresses work at the Hakkasan club in the MGM Hotel in Las Vegas, which is the site of many tech conferences including the upcoming Microsoft Azure & AI Conference. See https://azure.microsoft.com/en-us/community/events/azure-ai-conference/.*

Carlsen, age 27, from Norway became champion in 2013 and this was his third successful defense of his title. Caruana, age 26 from the United States came close to winning the match. All 12 games of the match were drawn — quite remarkable. Carlsen had an advantage in a few of the games but was unable to defeat Caruana. So, the match went to a set of four tie-break games where each player gets just 25 minutes, instead of the usual 100 minutes for the first 40 moves.

*Champion Magnus Carlsen on the left and challenger Fabiano Caruana on the right at the start of the tie-break games. Incredible tension.*

Carlsen, played the white pieces in the first tie-break game. The tension must have been enormous for the players, knowing that one move could decide the game and likely the fate of the entire match. The champion broke though by winning a pawn and, after one minor mistake by Caruana, slowly converted his advantage to win the game.

In the second tie-break game, Carlsen, playing black, used the sharp Sveshnikov variation of the Sicilian Defense. In an extremely complex position, Caruana made a critical mistake on move 26 and Carlsen instantly pounced and Caruana had to resign on move 28 when his queen was attacked but as soon as it was moved, Carlsen’s knight would deliver a devastating check on the d6 square.

*Carlsen playing black, has just moved his king from g8 to h7 to avoid a check from white’s knight. Caruana, playing white, had to resign here because as soon as he moves his attacked queen, Carlsen would play Nd6 with check and win the rook on the c1 square.*

So, after two of the four tie-break games, Carlsen led 2-0 and Caruana had to win the third game or the match would be over. Carlsen played solidly. Caruana was unable to break through and had to start taking huge risks. But Carlsen defended smoothly and Caruana had to resign on move 51 and so Carlsen retained his crown.

What a great match. The big question now in my mind is how well can Caruana recover from what must be an incredibly disappointing loss. Will Caruana be the challenger in 2020 or will one of the handful (perhaps 10) of other super strong grandmasters compete for the title?

*Four of the 16 chess world champions. Wilhelm Steinitz (Austria-Hungary, 1st champion, 1886-1894), Jose Raul Capablanca (Cuba, 3rd, 1921-27). Max Euwe (Holland, 5th, 1935-37). Mikhail Tal (Latvia / Soviet Union, 8th, 1960-61). Fabiano Caruana didn’t quite become the 17th.*

The ideas of primals and duals, and ascent and descent are closely related. Suppose you have some machine learning model with two weights, w1 and w2. The most common way to train the model is to find the values of w1 and w2 so that error against a set of training is minimized. Suppose that you somehow know that the error function is E(w1, w2) = w1^2 + w2^2. In a real problem you don’t know the full error function so you have to approximate it probabilistically (“stochastically”).

Because you want to minimize error, you use one of many techniques to gradually reduce the error. In many training algorithms you use the Calculus derivative of the error function and these techniques have the term “gradient” in their description. Such a technique is called “descent” because error goes down.

The explanation I’ve given so far describes “stochastic gradient descent” for neural networks.

But sometimes working directly with the error function is very difficult. This introduces the idea of a “dual”. The dual of an error function is a different function that’s closely related mathematically. The original function is now called the primal to distinguish it. For example suppose you have D(w1, w2) = 100 – w1^2 – w2^2 and you somehow know that maximizing the value of the dual also minimizes the value of the primal.

*Click to enlarge. This is the image I use when explaining stochastic gradient descent to a class or workshop.*

So, to solve the original problem of finding the values of w1 and w2 that minimize error (the primal), you can find the values of w1 and w2 that maximize the dual. Because you use a technique that gradually increases the value of the dual, such a technique is called “ascent”.

Most minimization (for descent) or maximization (for ascent) technique consider all weights and their associated variables more or less together. But in some specialized algorithms, the minimization or maximization technique looks at just one weight and its variable at a time, and such techniques are sometimes called “coordinate”. So, SDCA doesn’t look at the model error function directly, it looks at a **dual** function and maximizes it (**ascent**). And the technique doesn’t look at all weights at once, it examines one weight at a time (holding the others constant), i.e., **coordinate** . And because the technique samples one or a few training items at a time, it’s **stochastic**. Hence, SDCA.

Before I sign off, the term ascent in ML is also used in scenarios where, instead of maximizing a dual function, you maximize a likelihood function. This technique is sometimes called “maximum likelihood estimation” (MLE).

*Three illustrations by British artist John Harris (born 1948). The one on the left is titled “Ascent”.*