Machine Learning Scoring Rules

A scoring rule is a function that measures the accuracy of a set of predicted probabilities. For example, suppose you have a very oddly shaped dice with four sides. You somehow predict the probabilities of the four sides:

(0.20, 0.10, 0.40, 0.30)

Then you actually roll the dice several thousand times and find that the actual probabilities are:

(0.25, 0.15, 0.50, 0.10)

How good was your prediction? Instead of using a scoring rule, you could use ordinary error, for example, you could use mean squared error:

MSE = (0.05^2 + 0.05^2 + 0.10^2 + 0.20^2) / 4
    = 0.0025 + 0.0025 + 0.01 + 0.04
    = 0.01375

Error values are always positive so smaller values indicate a better prediction.

But a more common approach is to use a scoring rule, which is sort of an indirect measure of error. Scoring rules are most often used in situations where one outcome occurs. For example, suppose you have a system that predicts the probabilities of who will win a political election between Adams, Baker, and Cogan:

p = (0.20, 0.50, 0.30)

Suppose the election is held and Cogan wins. The actual probabilities are:

x = (0.00, 0.00, 1.00)

The Logarithmic Scoring Rule is calculated like so:

LSR = (0.0)(ln(0.2)) + (0.0)(ln(0.5)) + (1.0)(ln(0.30))
    = ln(0.30)
    = -1.20

Notice that the calculation can be simplified to just “take the ln of the probability associated with the actual outcome.”

Suppose your prediction that Cogan would win was better:

(0.10, 0.20, 0.70)

Now the LSR = ln(0.70) = -0.35 which is greater (less negative) than the first prediction. In short, LSR values are always negative and larger (less negative) values indicate a better prediction.

Posted in Machine Learning | Leave a comment

The SOLID Design Principles – Absolute Nonsense

One of the biggest pieces of nonsense in software engineering is the set of so-called SOLID principles. SOLID stands for SRP (single responsibility principle), OCP (open-closed principle), LSP (Liskov substitution principle), I (interface segregation principle), D (dependency inversion principle).

WorshipMe

Inexperienced developers are awed by the majesty of a cool acronym. However, virtually every senior developer I know sees SOLID as a joke perpetrated on college computer science students.

The problem is that SOLID just takes a few very general principles for good OOP software design and then slaps labels on them. Take for example the LSP which says, and I quote, “objects in a program should be replaceable with instances of their subtypes without altering the correctness of that program.” Brilliant! All this means is that types that derive from a parent type should derive from the parent type.

Or take the SRP, “a class should have only a single responsibility (i.e. only one potential change in the software’s specification should be able to affect the specification of the class).” OK, suppose you have a business specification that defines a Book type. There can be many reasons for a change. Suppose you change a date field format from American style month-day-year to European style. Or you change the way Book IDs are defined.

Now, the underlying ideas behind SOLID are fine. But slavish devotion to a few general principles and the worship of acronyms is lame. In my world, if my colleagues and I asked a job applicant about SOLID and they parroted back definitions with a look of divine revelation on their faces, that applicant would be not rated highly. But an applicant who could demonstrate scenarios where basic OOP principles should be violated would get our respect.

Here’s my new acronym for good software design: the AWESOME principles. Software should be Awesomely recursive, Waterproof, Environmentally friendly, Slick, Open to inclusion, Magical and Extra special.

Posted in Miscellaneous | Leave a comment

The Kelly Betting Criterion

The Kelly betting criterion is an interesting idea from probability. It’s best explained by example. Suppose you can make bets (or investments in economic terms) that have a positive expectation (meaning you’ll win more than half the time). Each bet is independent. How much of your bankroll should you bet each play in order to maximize your winnings?

KellyCriterionEquation

If you bet too much, you run the risk of a streak of bad luck and you’d lose all your money. If you bet too little, you won’t be getting the maximum value out of each winning bet.

Suppose your probability of winning is p = 0.60 so your probability of losing is q = 0.40. Expressed as odds, odds of winning is b = (0.60 / 0.40) to 1, which is 1.5 to 1. Now suppose your payoff, coincidentally, is also 1.5 to 1 meaning for every $1 bet, if you win you get $1.50 (pretty nice when you have a greater than 0.50 chance of winning!)

The Kelly criterion says that in order to maximize your profit you should bet (bp – q) / b percent of your bankroll each time.

Suppose you start with $1000 and p = 0.60 and q = 0.40 and b = 1.5 as above. You should bet a fraction

f = (1.5)(0.60) – 0.40 / 1.5
= (0.90 – 0.40) / 1.5
= 0.50 / 1.5
= 0.33

So you’d bet $333.33 on your first bet, and in general 1/3 of your bankroll every time. You’d lose everything if you lost three times in a row. The chances of that happening are (0.40)(0.40)(0.40) = 0.064 or 64 times in a thousand. In practice, you’d be wise to be a bit conservative and bet less than 1/3 of your bankroll each play.

Posted in Machine Learning | Leave a comment

A Neural Network in Python

I’ve been looking at creating neural networks using the Microsoft CNTK tool. CNTK is complicated. Very complicated, and a bit rough around the edges because it was developed as an internal tool rather than for public use.

In order to understand the architecture of CNTK I wanted to experiment with different input values, weight values, and bias values. So I decided to dust off an old Python implementation of a neural net, and refactor it from Python 2 to Python 3.

The refactoring was much easier than I thought — mostly changing V2 print statements to V3 print functions. Luckily I used V2 range() instead of xrange() so I didn’t have to worry about that.

NeuralNetWithPythonDemo

Another reason I used Python to explore CNTK is that Python seems to be the utility language of choice for CNTK systems. For example, one of the tools that comes with CNTK is a Python script that downloads and formats the MNIST image recognition data set.

I don’t use Python all that often, but when I do use the language I like it. Python hits a sweet spot between simplicity and complexity.

My demo uses fake data that mirrors the famous Iris data set. Good fun.

NeuralNetWithPythonCODE

Posted in Machine Learning | Leave a comment

An Interview for the 2016 DevConnections Conference

I will be speaking at the 2016 DevConnections conference, October 10-13, at the Aria Hotel in Las Vegas. The DevConnections people did a short interview with me and published it in the Windows IT Pro Magazine. I describe what my two talks will be about. See:
http://windowsitpro.com/connections/itdev-connections-2016-speaker-highlight-james-mccaffrey.

In my opinion, the DevConnections conference (also called IT/DevConnections) is one of the top three conferences for software developers and IT engineers who use Microsoft technologies. This will be my 12th year at DevConnections and I wouldn’t go if I didn’t think the event had great value.

InterviewWebPageForITDevConnections2016

There’s a lot to be learned at DevConnections, but the reality is that you can learn a ton by using the Internet. However, attending an event like DevConnections in person has huge benefits. For example, there’s a ton of research which shows that conference attendees return to their work place with renewed energy, enthusiasm, motivation, and increased attention. This has to translate into improved productivity.

Now to be sure, conferences are pricey. But the DevConnections Web site has links to information you can use to help convince your boss to send you. See the Web site at: http://www.itdevconnections.com/dc16/Public/Enter.aspx.

You can get a $500 registration discount using the code 500SPKR.

Come visit me in Las Vegas!

Posted in Conferences | Leave a comment

Granger Causality and Nicolas Cage Movies

One of the things I like best about where I work is that there are a lot of smart people. Really, really smart people. I was talking to John K today and he told me about research he’s working on that involves causality — does one thing cause another or not.

Proving that one thing causes another is often difficult. For example, the annual number of people who drowned by falling into a swimming pool from 1999 to 2009 correlates closely with the number of films that actor Nicolas Cage appeared in that year. But correlation does not mean causation, otherwise Cage could decrease pool drownings by retiring.

GrangerCausalitySwimmingCage

There’s an interesting notion called Granger Causality. That’s when one time series can be used to make predictions on a second time series that are better than the predictions using only data in the second time series alone.

For example, in the graphs below, beer sales spike on Days 1, 6, 10. That information could be used to improve the predictions for pizza sales.

GrangerCausality

Granger causality is really more of a statistical thing than cause as in “cause and effect”. Causality is a pretty deep concept.

Posted in Machine Learning | Leave a comment

My Talk About Prediction Markets

I gave a talk about Prediction Markets at Microsoft Research. In a Prediction Market, experts buy and sell shares of future outcomes and get money if their prediction is correct. For example, suppose you have 12 experts and you want to predict the outcome of a political election between candidates Adams, Baker, and Cogan.

AlexIntroducingJames

Experts buy and sell shares of the three candidates. After each purchase or sale, the prices of shares of the candidates goes up or down. After the election is held, experts get $1 for every share they hold of the winning candidate.

Just before the election is held, you can determine the probabilities of each of the three candidates winning by using the number of outstanding shares held.

MiddleOfTalk

In my talk I explained the math equations used to determine the prices of shares, and the probabilities of outcomes.

There was a lot more interest in the talk than I thought there’d be. The lecture room was full and a couple of hundred people watched a streaming broadcast of the talk too.

Posted in Conferences, Machine Learning | Leave a comment