The Logit Function

My background is in mathematics. One thing about math that I find simultaneously fascinating and confusing is that there are an insane number of conceptual connections between different ideas. Fully understanding connections takes months or years of study.


An example is the logit function — on one hand it’s very simple, but on the other hand it’s related to dozens of other concepts that at first glance are unrelated, in particular logistic regression prediction.

The easy part. If p is a number between 0 and 1 then:

logit(p) = ln( p / 1-p )

Because p / 1-p is the odds of something, the logit function is also called log-odds.

For example, suppose p = 0.6 then logit(p) = ln( 0.6 / 0.4 ) = ln(1.5) = 0.4055 where 1.5 is the odds.

Now logistic regression is more complicated. Suppose you want to predict something that can take one of two values, such as ‘democrat’ or ‘republican’, based on features such as age (x1) and income (x2).

The logistic function is logistic(x1, x2) = 1 / ( 1 + e^z ) where z = b0 + (b1)(x1) + (b2)(x2). The constants b0, b1, b2 must be determined using training data that has known input and output values.

Anyway, as it turns out the logit and logistic functions are mathematical inverses of each other: logit(logistic(z)) = z and logistic(logit(z)) = z.

This conceptual connection between the logit and logistic functions means that logistic regression prediction can be performed directly, using the logistic function, or indirectly, using the logit function. The R language uses the indirect approach, which for me, is more difficult to understand.

Posted in R Language | Leave a comment

Rectified Linear Activation for Neural Networks

There has been a lot of recent research work done on deep neural networks. One result is that it’s now thought that using standard logistic sigmoid activation or tanh activation doesn’t work as well as rectified linear activation.

If you’re not familiar with neural networks this probably sounds like gibberish. I’ll try to explain. The key item in a neural network is called a hidden processing node. The value of a hidden node is the sum of the products of inputs into the node and corresponding weights, plus a bias constant, and then you take the tanh() of that sum.


The tanh() is called an activation function. The tanh() function can accept any value from negative infinity to positive infinity, and returns a value between -1.0 and +1.0. An alternative to tanh() is called logistic sigmoid, abbreviated sigmoid(), which is similar but returns a value between 0.0 and +1.0.

When a neural network is trained, you need to use the calculus derivative of the activation function. For tanh() the derivative is (1 – x)(1 + x). For sigmoid() the derivative is (x)(1 – x).

Rectified linear activation is so simple it’s confusing. In words you return 0 if x is negative, or you return x if x is positive. So for the example in the image above, the sum of products plus the bias is 1.05 so the final value after linear rectified activation is 1.05.

The calculus derivative is also too simple. If x is negative the derivative is 0. If x is positive, the derivative is 1.

It’s not 100% clear why linear rectified activation seems to work better than tanh() or sigmoid() for deep neural networks. For sure linear rectified doesn’t suffer from what’s called vanishing gradient. Linear rectified also implicitly introduces a form of what’s called dropout.

Very complex but interesting topic.

Posted in Machine Learning | Leave a comment

Another Look at R Tools for Visual Studio

Microsoft is making big investments in the R language. R is used mostly for data analyses, for example, performing a linear regression analysis on some data.

The basic R language is very old. When you install Base R, you get a simple but effective program to run R commands and scripts, called the RGui.exe tool.

About eight weeks ago I took a first look at a preview version of R Tools for Visual Studio (RTVS). RTVS is an add-in for the very powerful Visual Studio programming environment.


I decided to see what was new with RTVS. First, instead of using Base R, I downloaded and installed the new Microsoft R Client which is really a wrapper around Microsoft R Open (MRO). MRO extends R by using multi-threaded math libraries for faster performance, and a special checkpoint package that manages R package dependencies.


After I installed MRO, I updated my existing Visual Studio 2015 to add the Update 3 package. RTVS only works with VS 2015 Update 2 or later.

So at this point I had MRO and VS 2015 Update 3 on my machine and I was ready to install RTVS 0.4. The install was essentially a VS update. It went smoothly and quickly.

After installing RTVS, I launched VS. VS automatically sensed I had MRO and gave me a little dialog box so I could tell VS to use MRO instead of the Base R on my machine. After VS launched, the File | New | Project option had a new R Project template. Very nice.

It’s still too early for me to form a solid opinion of RTVS, but I like what I’ve seen so far.

Posted in R Language | Leave a comment

Machine Learning Scoring Rules

A scoring rule is a function that measures the accuracy of a set of predicted probabilities. For example, suppose you have a very oddly shaped dice with four sides. You somehow predict the probabilities of the four sides:

(0.20, 0.10, 0.40, 0.30)

Then you actually roll the dice several thousand times and find that the actual probabilities are:

(0.25, 0.15, 0.50, 0.10)

How good was your prediction? Instead of using a scoring rule, you could use ordinary error, for example, you could use mean squared error:

MSE = (0.05^2 + 0.05^2 + 0.10^2 + 0.20^2) / 4
    = 0.0025 + 0.0025 + 0.01 + 0.04
    = 0.01375

Error values are always positive so smaller values indicate a better prediction.

But a more common approach is to use a scoring rule, which is sort of an indirect measure of error. Scoring rules are most often used in situations where one outcome occurs. For example, suppose you have a system that predicts the probabilities of who will win a political election between Adams, Baker, and Cogan:

p = (0.20, 0.50, 0.30)

Suppose the election is held and Cogan wins. The actual probabilities are:

x = (0.00, 0.00, 1.00)

The Logarithmic Scoring Rule is calculated like so:

LSR = (0.0)(ln(0.2)) + (0.0)(ln(0.5)) + (1.0)(ln(0.30))
    = ln(0.30)
    = -1.20

Notice that the calculation can be simplified to just “take the ln of the probability associated with the actual outcome.”

Suppose your prediction that Cogan would win was better:

(0.10, 0.20, 0.70)

Now the LSR = ln(0.70) = -0.35 which is greater (less negative) than the first prediction. In short, LSR values are always negative and larger (less negative) values indicate a better prediction.

Posted in Machine Learning | Leave a comment

The SOLID Design Principles – Absolute Nonsense

One of the biggest pieces of nonsense in software engineering is the set of so-called SOLID principles. SOLID stands for SRP (single responsibility principle), OCP (open-closed principle), LSP (Liskov substitution principle), I (interface segregation principle), D (dependency inversion principle).


Inexperienced developers are awed by the majesty of a cool acronym. However, virtually every senior developer I know sees SOLID as a joke perpetrated on college computer science students.

The problem is that SOLID just takes a few very general principles for good OOP software design and then slaps labels on them. Take for example the LSP which says, and I quote, “objects in a program should be replaceable with instances of their subtypes without altering the correctness of that program.” Brilliant! All this means is that types that derive from a parent type should derive from the parent type.

Or take the SRP, “a class should have only a single responsibility (i.e. only one potential change in the software’s specification should be able to affect the specification of the class).” OK, suppose you have a business specification that defines a Book type. There can be many reasons for a change. Suppose you change a date field format from American style month-day-year to European style. Or you change the way Book IDs are defined.

Now, the underlying ideas behind SOLID are fine. But slavish devotion to a few general principles and the worship of acronyms is lame. In my world, if my colleagues and I asked a job applicant about SOLID and they parroted back definitions with a look of divine revelation on their faces, that applicant would be not rated highly. But an applicant who could demonstrate scenarios where basic OOP principles should be violated would get our respect.

Here’s my new acronym for good software design: the AWESOME principles. Software should be Awesomely recursive, Waterproof, Environmentally friendly, Slick, Open to inclusion, Magical and Extra special.

Posted in Miscellaneous | Leave a comment

The Kelly Betting Criterion

The Kelly betting criterion is an interesting idea from probability. It’s best explained by example. Suppose you can make bets (or investments in economic terms) that have a positive expectation (meaning you’ll win more than half the time). Each bet is independent. How much of your bankroll should you bet each play in order to maximize your winnings?


If you bet too much, you run the risk of a streak of bad luck and you’d lose all your money. If you bet too little, you won’t be getting the maximum value out of each winning bet.

Suppose your probability of winning is p = 0.60 so your probability of losing is q = 0.40. Expressed as odds, odds of winning is b = (0.60 / 0.40) to 1, which is 1.5 to 1. Now suppose your payoff, coincidentally, is also 1.5 to 1 meaning for every $1 bet, if you win you get $1.50 (pretty nice when you have a greater than 0.50 chance of winning!)

The Kelly criterion says that in order to maximize your profit you should bet (bp – q) / b percent of your bankroll each time.

Suppose you start with $1000 and p = 0.60 and q = 0.40 and b = 1.5 as above. You should bet a fraction

f = (1.5)(0.60) – 0.40 / 1.5
= (0.90 – 0.40) / 1.5
= 0.50 / 1.5
= 0.33

So you’d bet $333.33 on your first bet, and in general 1/3 of your bankroll every time. You’d lose everything if you lost three times in a row. The chances of that happening are (0.40)(0.40)(0.40) = 0.064 or 64 times in a thousand. In practice, you’d be wise to be a bit conservative and bet less than 1/3 of your bankroll each play.

Posted in Machine Learning | Leave a comment

A Neural Network in Python

I’ve been looking at creating neural networks using the Microsoft CNTK tool. CNTK is complicated. Very complicated, and a bit rough around the edges because it was developed as an internal tool rather than for public use.

In order to understand the architecture of CNTK I wanted to experiment with different input values, weight values, and bias values. So I decided to dust off an old Python implementation of a neural net, and refactor it from Python 2 to Python 3.

The refactoring was much easier than I thought — mostly changing V2 print statements to V3 print functions. Luckily I used V2 range() instead of xrange() so I didn’t have to worry about that.


Another reason I used Python to explore CNTK is that Python seems to be the utility language of choice for CNTK systems. For example, one of the tools that comes with CNTK is a Python script that downloads and formats the MNIST image recognition data set.

I don’t use Python all that often, but when I do use the language I like it. Python hits a sweet spot between simplicity and complexity.

My demo uses fake data that mirrors the famous Iris data set. Good fun.


Posted in Machine Learning | Leave a comment