Yet Even More About PyTorch Neural Network Weight Initialization

I’ve been working through the details of the PyTorch neural network library. I’m still examining basic concepts like weight and bias initialization. Even a task as simple as setting weights to some fixed value is surprisingly tricky.

Here’s example code that sets up a 4-7-3 NN (for the Iris Dataset problem):

# PyTorch 0.4.1 Anaconda3 5.2.0 (Python 3.6.5)

class Net(T.nn.Module):
  def __init__(self):
    super(Net, self).__init__()
    self.fc1 = T.nn.Linear(4, 7)  # 'fully connected'
    T.nn.init.xavier_uniform_(self.fc1.weight)
    T.nn.init.zeros_(self.fc1.bias)

    self.fc2 = T.nn.Linear(7, 3)
    T.nn.init.xavier_uniform_(self.fc2.weight)
    T.nn.init.uniform_(self.fc2.bias, -0.05, 0.05)

  def forward(self, x):
    x = T.tanh(self.fc1(x))
    x = self.fc2(x) 
    return x

There’s a lot going on in that code. I was experimenting by setting hard-coded weight values, for example:

for j in range(7):
  for i in range(4):
    self.fc1.weight[j][i] = 0.5555  # errors

But this will throw an error. The fc1.weight object is type Parameter. But this code works:

for j in range(7):
  for i in range(4):
    self.fc1.weight.data[j][i] = 0.5555 # OK

The fc1.weight.data is a base Tensor object. A very helpful expert (“ptrblck”) on the PyTorch discussion forum recommended:

with T.no_grad():
  for j in range(7):
    for i in range(4):
      self.fc1.weight[j][i] = 0.5555  # OK

Wow. I’ve used many code libraries before and trust me, this is tricky stuff. The moral of the story is that in order to be successful at writing code, you have to relentlessly pay attention to very tiny details. Thinking in terms of the big picture just doesn’t work when you’re implementing code.



I used to enjoy reading the daily “Herman” cartoons in newspapers by Canadian cartoonist Jim Unger (1937 – 2012). The series ran from 1975 to 1992.

Advertisements
Posted in Machine Learning, PyTorch | Leave a comment

More Fantasy Football and Machine Learning

I have this vague notion that there must be interesting connections between fantasy football and machine learning. I know a lot about machine learning but not a whole lot about fantasy football. So, several days ago I set up a fantasy football league of six teams. I used ESPN, one of the three largest services (ESPN, NFL, Yahoo).

I wanted to explore further so I set up a second dummy league, this time using the NFL.com system. My goal was to see how standardized, or not, fantasy football is across services. The bottom line is that as far as I can tell the services are very similar in terms of rules. Therefore, the choice of service depends mostly on the Web management interface. My conclusion is that all three services Web interfaces are about the same. None of the three is what I’d rate as excellent, but they’re all pretty good.

The screenshots below (click to enlarge) show most of the major steps involved in setting up a fantasy football league. I learned from previous experience that it’s a good idea to sign up for a (required) NFL.com registration BEFORE trying to do anything else because trying to register in the middle of the league sign-up process basically doesn’t work.







While I was doing all this, in the back of my mind was a realization of how I learn. Some people learn best by first learning general principles and then learning specific applications of the principles. Other people learn best in the opposite way: learning a few specific examples, and then using that knowledge to infer the general principles involved.

I’ve been aware of this learning difference ever since my days as a university professor. The difference was very pronounced, and so I tried to teach every class using both techniques (general to specific, specific to general).

I am definitely a specific to general learner, but based on my experience, I suspect most people are general to specific learners. Of course it’s not a binary thing, both approaches are needed.



“Café de Flore” by Barbara Flowers, and a photograph of it. General and specific.

Posted in Machine Learning, Miscellaneous | 1 Comment

Calculating Gini Impurity Example

The Gini Impurity (GI) metric measures the homogeneity of a set of items. GI can be used as part of a decision tree machine learning classifier. The lowest possible value of GI is 0.0. The maximum value of GI depends on the particular problem being investigated, but gets close to 1.0.

Suppose you have 12 items — apples, bananas, cherries. If there are 0 apples, 0 bananas, 12 cherries, then you have minimal impurity (this is good for decision trees) and GI = 0.0. But if you have 4 apples, 4 bananas, 4 cherries, you have maximum impurity and it turns out that GI = 0.667.

Instead of showing the math equation (you can find it on Wikipedia), I’ll show example calculations. Maximum GI:

         apples  bananas  cherries
count =  4       4        4
p     =  4/12    4/12     4/12
      =  1/3     1/3      1/3

GI = 1 - [ (1/3)^2 + (1/3)^2 + (1/3)^2 ]
   = 1 - [ 1/9 + 1/9 + 1/9 ]
   = 1 - 1/3
   = 2/3
   = 0.667

When the number of items is evenly distributed, as in the example above, you have maximum GI but the exact value depends on how many items there are. A bit less than maximum GI:

         apples  bananas  cherries
count =  3       3        6
p     =  3/12    3/12     6/12
      =  1/4     1/4      1/2

GI = 1 - [ (1/4)^2 + (1/4)^2 + (1/2)^2 ]
   = 1 - [ 1/16 + 1/16 + 1/4 ]
   = 1 - 6/16
   = 10/16
   = 0.625

In the example above, the items are not quite evenly distributed, and the GI is slightly less (which is better when used for decision trees). Minimum GI:

         apples  bananas  cherries
count =  0       12        0
p     =  0/12    12/12     0/12
      =  0       1         0

GI = 1 - [ 0^2 + 1^2 + 0^2 ]
   = 1 - [ 0 + 1 + 0 ]
   = 1 - 1
   = 0.00

In the example above, the items are as unevenly distributed as possible, and the GI is the smallest possible value of 0.0 (which is best possible situation when used for decision trees).

The Gini impurity metric is not an acronym, it’s named after mathematician Corrado Gini. The Gini index is not at all the same as a different metric called the Gini coefficient. The Gini impurity metric can be used when creating a decision tree but there are alternatives, including entropy information gain. The advantage of GI is its simplicity.



“Purity” by Italian artist Pino Daeni and “Purity” by Chinese artist Jia Liu. I can create sophisticated software systems but I could never create art like these paintings.

Posted in Machine Learning | Leave a comment

The P” Programming Language

The P” (letter P followed by two single-quote characters, “P prime-prime”) programming language is not really a practical programming language, it’s mostly a theoretical notion. P” was created in a 1964 research paper by C. Bohm. The base P” has six instructions:

1. R  - move instruction pointer to right
2. L  - move instruction pointer to left
3. r  - increment [ptr]
4. r' - decrement [ptr]
5. (  - begin loop until [ptr] = 0
6. )  - end loop

P” doesn’t have any input / output statements, but if you add i (for input) and o (for output), then here is a P” program that adds 3 plus 5:

rrrRrrrr(LrRr')rrrrrrrr(LrrrrrrRr')o

If this is the first time you’ve seen P” it’s probably a bit confusing, but if you read the Wikipedia article at https://en.wikipedia.org/wiki/P%E2%80%B2%E2%80%B2 you’ll quickly understand it.

There are several not-very-clever variations of P” where hobbyists substitute symbols and create what they think are cute variations of the language, but P” is really what all these variations are. In general, strange, non-practical languages like P” are sometimes called esoteric languages.

The topic of this blog post was motivated in part by a few news items I read recently. One was about the attempts to add fill-in-the-blank categories of people to tech companies by creating dummied-down programs like “Day of Coding” for young students. In the end, a person is only successful at something they’re truly passionate about.

If I saw a high school student who was fascinated by P” then I’d be almost certain that he’d be well suited to study computer science. But if I saw a student who thought a drag-and-drop thing that moves an Angry Bird cartoon icon around a maze was cool, I wouldn’t be sure if they’d be suited for serious computer science or not.

Computer Science is difficult. Really difficult. So are things like electrical engineering, biochemistry, and so on. To succeed in these areas, a person has to have relentless drive and that drive has to be inner motivated, not pushed by an external agenda. Put another way, if I were the czar of increasing the pipeline of students into science and tech, I’d focus the vast majority of my efforts on K-6 education. In my opinion (backed by quite a bit of research), by age 13, a student’s passions for topics are there or not for the most part.



This guy is about to hear some non-esoteric language from his girlfriend. His instruction pointer will likely not get incremented for a few days.

Posted in Miscellaneous

NFL 2018 Week 1 Predictions – Zoltar Likes Five Vegas Favorites

Zoltar is my NFL prediction computer program. Zoltar uses Reinforcement Learning and a Deep Neural Network. Here are Zoltar’s predictions for week 1 of the 2018 NFL season:

Zoltar:      eagles  by    6  dog =     falcons    Vegas:      eagles  by    4
Zoltar:    steelers  by   12  dog =      browns    Vegas:    steelers  by    6
Zoltar:     bengals  by    0  dog =       colts    Vegas:       colts  by    3
Zoltar:      titans  by    0  dog =    dolphins    Vegas:      titans  by    2
Zoltar:     vikings  by   10  dog = fortyniners    Vegas:     vikings  by  5.5
Zoltar:      saints  by    9  dog =  buccaneers    Vegas:      saints  by  9.5
Zoltar:    patriots  by   11  dog =      texans    Vegas:    patriots  by  6.5
Zoltar:     jaguars  by    4  dog =      giants    Vegas:     jaguars  by  3.5
Zoltar:      ravens  by    3  dog =       bills    Vegas:      ravens  by  5.5
Zoltar:    chargers  by    1  dog =      chiefs    Vegas:    chargers  by    3
Zoltar:    panthers  by    6  dog =     cowboys    Vegas:    panthers  by  2.5
Zoltar:   cardinals  by    5  dog =    redskins    Vegas:    redskins  by    0
Zoltar:    seahawks  by    0  dog =     broncos    Vegas:     broncos  by  2.5
Zoltar:     packers  by    6  dog =       bears    Vegas:     packers  by  8.5
Zoltar:       lions  by    6  dog =        jets    Vegas:       lions  by    7
Zoltar:        rams  by    1  dog =     raiders    Vegas:        rams  by    3

Zoltar theoretically suggests betting when the Vegas line is more than 3.0 points different from Zoltar’s prediction. For week 1 Zoltar has five suggestions. However, week #1 is a lot of guesswork, for both Zoltar and Las Vegas.

1. Zoltar likes the Vegas favorite Steelers against the Browns. Zoltar thinks the Steelers are 12 points better than the Browns but Vegas says the Steelers will win only by 6.0 points. A bet on the Steelers will pay only if the Steelers win by more than 6.0 points (in other words, 7 or more points).

2. Zoltar likes the Vegas favorite Vikings against the 49ers. Zoltar believes the Vikings will win by 10 points but Vegas thinks the Vikings will win by only 5.5 points, therefore, Zoltar thinks the Vikings will “cover the spread”.

3. Zoltar likes the Vegas favorite Patriots over the Texans. Zoltar thinks the Patriots are 11 points better than the Texans, but Vegas says the Patriots are only 6.5 points better than the Texans.

4. Zoltar likes the Vegas favorite Panthers against the Cowboys. Zoltar thinks the Panthers are 6 points better than the Cowboys but Vegas has the Panthers as only 2.5 points better than the Cowboys.

5. Zoltar likes the Arizona Cardinals against the Washington Redskins. Vegas has this as a “pick ’em game”. Zoltar believes Cardinals are 6 points better than the Redskins.

Update 09/07/2018 – The Vegas point spread for the Eagles vs. Falcons game jumped from Eagles being favored by 4 to the Falcons being favored by 3 when it was announced that the Eagles quarterback, Carson Wentz, was out for the game due to injury. Zoltar now strongly recommends a bet on the Eagles. Zoltar believes that the impact of any single player, even a quarterback, is at most 3 points.

Theoretically, if you must bet $110 to win $100 (typical in Vegas) then you’ll make money if you predict at 53% accuracy or better. But realistically, you need to predict at 60% accuracy or better.



Here’s what a betting sheet in Las Vegas looks like (top picture). I picked this one up from the MGM Grand Sports Book (bottom picture) while at a recent conference.


Just for fun, I track how well Zoltar does when just trying to predict just which team will win a game. This isn’t useful except for parlay betting.

Zoltar sometimes predicts a 0-point margin of victory. There are three such games in week #1 –

Bengals vs. Colts
Titans vs. Dolphins
Seahawks vs. Broncos

In those situations, to pick a winner so I can track raw number of correct predictions, in the first four weeks of the season, Zoltar picks the home team to win. Therefore, Zoltar picks the Colts, Dolphins, and Broncos to win.

After week 4, Zoltar uses historical data for the current season (which usually, but not always, ends up in a prediction that the home team will win).



The Zoltar system is named after the arcade fortune teller machine, which is named after the wish-granting machine in the 1988 movie “Big” starring Tom Hanks. I saw this Zoltar outside a magic shop inside the New York New York Hotel in Las Vegas.

Posted in Machine Learning, Zoltar

A Neural Network using Just Python and NumPy

For the past dozen months or so, I’ve been working with neural network libraries including TensorFlow, Keras, CNTK, and PyTorch. As mental exercise, I decided to implement a neural network from scratch, using just raw Python and NumPy.

I’ve done this before. In fact, until about three years ago, implementing a neural network from scratch was just about the only option available because the neural libraries didn’t exist.

The exercise was a lot of fun. My implementation used online training — processing one line of training data at a time. Using mini-batch training is quite a bit more complicated so I’ll save that for another day. Additionally, I used mean squared error rather than the more common cross entropy error, mostly because back-propagation with MSE is a bit easier to understand than back-propagation with cross entropy.

As I was coding, I was mildly surprised to realize how much I’d learned over the past several months, while using Keras and CNTK. For example, when I initialized weights and biases, I used a Uniform random distribution, but I could have used Glorot Uniform or Glorot Normal, because I now understand those algorithms well.

The moral of the story is that the field of machine earning is moving very fast and it’s important to keep as up to date as possible. And that requires daily effort.



The F-104 jet fighter set a world speed record of 1404 miles per hour (2260 km/h) in 1958. That’s fast, even by the standards of modern jet fighters fifty years later.

Posted in Machine Learning

Recap of the 2018 VM World Conference

I gave a short talk about deep neural networks at the 2018 VM World conference. The event ran from August 26-30 in Las Vegas.

VM World is put on by the VMware company. VMware makes virtualization software that allows you to create several virtual machines on one physical machine, which basically saves money. In some sense then, VMware is a one trick pony in a way similar to how Google is essentially just an Internet search engine.


The event was at the Mandalay Bay conference center where I’ve spoken many times.


Shorts and backpacks = the IT guys. Trousers and sports jackets = the sales guys.

I estimate there were about 18,000 attendees at VM World. Attendance seemed a bit down compared to the last two times I attended (a few years ago). Most of the attendees I talked to either worked in an IT department of a large company, or were in sales of some kind.

My talk was “Understanding Deep Neural Networks”. It was part of an out-of-band set of five short presentations set up to augment the main talks. The session was set up by a colleague I used to work with at a different company who now works at VMware. I described what deep neural nets are and explained how they’re well-suited to run in a virtual environment.


The keynote talk on the first day was quite good.

I sat in on a few of the regular sessions. My impression is that VMware is way behind other tech companies when it comes to adopting machine learning and AI, at least compared to companies that I deal with. There was a sense of an old school mentality — protecting existing systems rather than looking to the future. I hope I’m wrong. I have several good friends at VMware. VMware is exactly 20 years old and historically this is the beginning of the end for many companies.


The Expo was big and interesting, but nothing particularly special.

As a side note, at some events I speak at, I request a press pass on the basis of my work with a very large tech magazine where I’m the Senior Contributing Technical Editor. Weirdly, VM World organizers declined my request and I had to use credentials supplied by my talk organizer.

My communications with the VMware people in charge of communications were a bit strange. These people weren’t exactly “the sharpest tools in the shed” or “were a few clowns short of a circus” as the sayings go. But I sensed just incompetence rather than malice.

I had a small sample size, but I was impressed by most of the attendees and speakers I talked to at VM World. As usual for me, the biggest benefit of speaking at a conference is the information I pick up in impromptu conversations. In particular, I can spot emerging trends in the ML/AI efforts of various companies, and use that information in my job to guide prioritization of efforts.



Amanda, Joan, and Krissy the clowns. I don’t like clowns. Especially female clowns. Ugh. Creepy and annoying. Well most of them.

Posted in Conferences