Research on Sports Statistics and Prediction

I’ve always been interested in machine learning predictions for sports. In particular I enjoy creating prediction systems for the National Football League games and American college NCAA basketball games.

But there doesn’t seem to be very much in the way of traditional research journals or conferences directed at the intersection of sports and mathematics and computer science.

I did a brief search of the Internet and found three research journals, four relevant conferences (in the U.S. and Canada), and one resources site. Here they are:

1. Journal of Quantitative Analysis in Sports – The editor in chief is a guy named Mark Glickman, who is a mathematician at Harvard. A typical article is “Estimating an NBA’s Player’s Impact on His Team’s Chances of Winning”, Vol. 12, Iss. 2, March, 2016. See

2. Journal on Sports Analytics – The editor is Philip Maymin from the University of Bridgeport. A typical article is “Heterogeneity and Team Performance: Evaluating the Effect of Cultural Diversity in the World’s Top Soccer League”, 2016. See

3. International Journal of Computer Science in Sport – The editor is Baca Arnold, from the University of Vienna, Austria. A typical article is “A Rating System For Gaelic Football Teams: Factors That Influence Success”, December 2016. See

4. The MIT Sloan Sports Analytics Conference – This is by far the dominant and largest conference. There were 4,000+ attendees in 2016. Sponsored by ESPN Sports. Huge range of topics, for example, ” Leveraging Digital Strategies and Analytics in Media and Sports”. Has a hybrid research plus sports personalities feel. See

5. The New England Symposium on Statistics in Sports – Small academic event. In odd-numbered years. Sponsored by Harvard. Example talk is “Nearest-neighbor matchup effects: Predicting March Madness”, September 2015. See

6. The Cascadia Symposium on Statistics in Sports – Looks like it’s a small Canadian version of the MIT Sloan event. Example talk is “Meta-Analytics: Evaluating The Reliability of Player Metrics”, September, 2016. See

7. Sports Analytics Innovation Summit – The sponsoring organization, “Innovation Enterprise” puts on dozens of small conferences (maybe 100-300 attendees) every year, in many locations, on a wide range of topics. I’ve spoken at their “Big Data and Analytics” event several times and liked the event. Talks tend to aim towards the business side of topics. See

8. American Statistical Association – This organization has a section on “Statistics in Sports” which has links to sources for statistical data. See

Posted in Conferences, Machine Learning, Top Ten | Leave a comment

Estimating a Polynomial Function using a Neural Network

One of my colleagues (Kirk O.) recently posed this challenge to me: create a neural network that can calculate the area of a triangle. I mentally scoffed. How hard could it be?

Well, several hours later, I had a working NN system that can calculate the area of a triangle but it was more difficult than I expected. The area of a triangle is one-half the base times the height. The “one-half” is just a constant so in reality the challenge is to compute f(x,y) = x * y or more generally, compute a polynomial.

I dove in a wrote some code but I really had to experiment with the hyperparameters, especially the number of hidden nodes to use, the learning rate, and the momentum factor. In the end, I used a 2-5-1 network — two input nodes for the base and the height, five hidden processing nodes, and a single output node for the area.

I used tanh for the hidden layer activation function. I didn’t need an output layer activation function because this is a regression problem. I had to use a very small learning rate (0.0001) and a momentum factor = 0.0 to get good results.

There’s always a problem trying to determine model accuracy when dealing with regression. Here I cheated a bit by counting an area that’s within 0.5 of the correct result as a correct prediction. I also fudged by limiting the base and height to be less than 10.

The demo NN got 91.70% accuracy on its 1000-item training data set, and then 97.00% accuracy on a 100-item test set. It’s unusual to have better accuracy on the test data than on the training data but with NNs weird things happen.

So, thanks Kirk, for making me toss a couple hours of my life away. No, actually, it was really a fun little problem and I learned some valuable tricks.

Posted in Machine Learning | 1 Comment

Recap of the 2017 Interop ITX Conference

I spoke at 2017 Interop ITX Conference which ran from May 15 – 19 in Las Vegas. See I estimate there were about 3,000 attendees. Most of the attendees I spoke to were fairly senior managers at all kinds of companies, small to large, in all kinds of businesses.

My talk was titled “Understanding Deep Neural Networks”. I explained in some detail what regular NNs are, then described some variations of DNNs including convolutional NNs, recurrent NNs, LSTM NNs, and generative adversarial networks. I also mentioned a bit about some of the work Microsoft Research is doing in the area of deep learning. My talk was hosted by Sam Charrington, who is very knowledgeable, and who gave a great intro to the current state of ML. See

I’ve spoken at many Interop events over the years and each one was different from all the others. My overall impression is that most of the nuts-and-bolts challenges of enterprise networks seem to have been solved and that the two dominant issues are security and evolving infrastructure to give a competitive business advantage (such as predictive systems that use deep neural networks).

Other talks at the event that I particularly liked were “Machine, Platform, Crowd” (Andrew McAfee, MIT), “Machine Learning” (Josh Bloom, General Electric), and “AI for Wireless Networking” (Ajay Malik, Google).

Anyway, I met a lot of interesting people and picked up a lot of useful information, mostly about the current state of machine learning in enterprises. Every company I talked to is trying to find some way to get expertise in how to create advanced predictive systems.

The bottom line is that the 2017 Interop ITX conference was a good experience. I intend to speak there next year and if you work in IT management I recommend you consider attending too.

Posted in Conferences | Leave a comment

An Alternative to Softmax for Neural Network Classifier Activation

Suppose you are using a neural network to make prediction where the thing-to-predict can be one of three possible values. For example, you might want to predict the political party affiliation of a person (democrat, republican, other) based on things like age, annual income, sex, and years of education.

A neural network classifier would accept four numeric inputs corresponding to age, income, sex, education and then generate a preliminary output of three values like (1.55, 2.30, 0.90) but then normalize the preliminary outputs so that they sum to 1.0 and can be interpreted as probabilities.

By far the most common normalizing function is called Softmax:

exp(1.55) = 4.71
exp(2.30) = 9.97
exp(0.90) = 2.46
sum = 17.15

softmax(1.55) = 4.71 / 17.15 = 0.28
softmax(2.30) = 9.97 / 17.15 = 0.58
softmax(0.90) = 2.46 / 17.15 = 0.14

If you are using the back-propagation algorithm for training, then you need to use the Calculus derivative of the Softmax function, which is softmax'(x) = x * (1-x).

I’d always wondered if there were alternatives to the Softmax function. I tracked down a rather obscure research paper published in 2016 that explored something called the Taylor Softmax function. The Taylor Softmax for the example values above is:

taylor(1.55) = 1.0 + 1.55 + 0.5 * (1.55)^2
             = 3.75
taylor(2.30) = 1.0 + 2.30 + 0.5 * (2.30)^2
             = 5.96
taylor(0.90) = 1.0 + 0.90 + 0.5 * (0.90)^2
             = 2.31
sum = 12.00

taylor-soft(1.55) = 3.75 / 12.00 = 0.31
taylor-soft(2.30) = 5.96 / 12.00 = 0.50
taylor-soft(0.90) = 0.90 / 12.00 = 0.19

The Calculus derivative of the Taylor Softmax is rather ugly:

I coded up a demo program to compare regular Softmax with the Taylor Softmax. My non-definitive mini-exploration showed the regular Softmax worked much better.

My conclusion: Almost everything related to neural networks is a bit tricky. The Taylor Softmax activation function may be worth additional investigation, but my micro-research example leaves me a bit skeptical about the usefulness of Taylor Softmax.

Posted in Machine Learning | Leave a comment

Time Series Regression with a Neural Network

A time series regression problem is one where the goal is to predict a numeric values based on previous (in time) numeric values. For example, you might want to predict the closing price of a share of some company’s stock based on the closing price on the previous three days.

There are many ways to tackle a time series regression problem. One basic approach that uses a neural network is best explained with a concrete example. Suppose you want to predict the value of the sine function. Here are some values:

x        sin(x)
0.0000	  0.0000
1.0000	  0.8415
2.0000	  0.9093
3.0000	  0.1411
4.0000	 -0.7568
5.0000	 -0.9589
6.0000	 -0.2794

You construct training data that looks like this:

0.0000,  0.8415,  0.9093,  0.1411
0.8415,  0.9093,  0.1411, -0.7568
0.9093,  0.1411, -0.7568, -0.9589
0.1411, -0.7568, -0.9589, -0.2794

The first three values in each line are the predictors, and the fourth value is the target to predict. It’s hard to explain in words but if you examine the data you’ll see what’s going on. Notice that data points get duplicated in the training data file. You can avoid this but in my opinion it’s better to keep things simple, by using duplicate data points.

At this point you have a standard prediction problem and you can use a basic neural network.

The example here looks back in time three data points. Why three and not four? Basically, the number of points in time to look backwards is a free parameter and must be determined by trial and error.

The moral of the story is that analyzing a time series regression problem with a neural network is really mostly about converting data into a usable format. By the way, a related approach for time series regression is to use a recurrent neural network.

Posted in Machine Learning | Leave a comment

Deep Neural Network Training Batch vs. Online

I’ve been getting my butt kicked, technically speaking, for the past couple of days. I’ve been exploring training deep neural networks. When I use standard online training with back-propagation, my code seems to work fairly well:

My code creates 2,000 dummy items. Each item has four inputs and three outputs and looks like (4.5, -3.2, 1.6, -2.0, 0, 0, 1). The generator uses a 4-(10,10,10)-3 deep NN — four inputs, three hidden layers of ten nodes each, and three outputs. Therefore, the generator has (4 * 10) + (10 * 10) + (10 * 10) + (10 * 3) + 30 + 3 = 303 weight and biases that must be determined.

One of the points of my investigation is to explore the vanishing gradient phenomenon. In the image above I display one gradient every 200 training epochs and you can see that, as expected, it quickly goes to nearly 0 (to four decimals).

So, just for fun I thought I’d see what the effect of using batch training would be:

What the heck?! The NN just doesn’t learn at all. Now I know that online training is better than batch training, but this result is extreme. I suspect I may have a bug in my batch-training version code. But, tracking down a problem in code like this could easily take days so I’m going to have to put it aside for now. Grrr.

Posted in Machine Learning | 2 Comments

Buster Blackjack

Most of the technical conferences I speak at are in Las Vegas. One of the things I love about Las Vegas is the constant innovation — every trip I see new restaurants, new kinds of entertainment, and new games.

When I was young I loved all kinds of card games, and I’m still fascinated by them. On a recent trip I saw a new variant of Blackjack called Buster Blackjack. The game is basically regular Blackjack, but on each hand you can bet $5 on an optional side bet. The side bet is independent (so you can win the buster bet even if you bust yourself) and you win if the dealer busts (goes over 21).

If the dealer busts with three or more cards (say she has a King and a Five, then draws an Eight), you win 1 to 1 ($5). The payoffs at the MGM Grand hotel and casino where I was are:

8+ cards  250 to 1
7 cards    50 to 1
6 cards    20 to 1
5 cards     8 to 1
4 cards     2 to 1
3 cards     1 to 1

There is some fascinating math here. If you are counting cards, for regular Blackjack you want the deck to hold mostly high cards (8, 9, 10, Ace) and very few low cards (2, 3, 4, 5) but if you’re betting that the dealer will bust, you want the deck to have low cards so that a busted hand will have more cards in it and give you a higher payout.

The only way you could analyze a strategy would be to write a computer simulation.

I didn’t play Buster Blackjack but maybe I’ll give the game a try after I do some analysis.

Posted in Machine Learning, Miscellaneous | Leave a comment