I was looking at a machine learning technique called probit (“probability unit”) classification. Probit classification is exactly like logistic regression classification except that where LR uses the logistic sigmoid function to compute output, probit uses the cumulative density function of the Gaussian (Normal) distribution. The CDF is the area under the curve of the Gaussian bell-shaped curve, from -infinity to some value z.

So, to write probit classification code, I needed to implement a CDF method using C#. Unlike the logistic sigmoid function which is very simple — 1.0 / (1.0 + Math.Exp(-z)) — there is no simple way to compute CDF. The most common approach is to use one of many close approximations. I used a famous equation 7.1.26 from a famous reference the “Handbook of Mathematical Functions” (1965), by Abramowitz and Stegun. The book is often just called “A&S”.

Actually, equation 7.1.26 is something called the “error function” (often called “erf”) but the CDF is just a slight modification of erf. Anyway, here’s one way to code CDF using C#:

static double CumDensity(double z) { double p = 0.3275911; double a1 = 0.254829592; double a2 = -0.284496736; double a3 = 1.421413741; double a4 = -1.453152027; double a5 = 1.061405429; int sign; if (z < 0.0) sign = -1; else sign = 1; double x = Math.Abs(z) / Math.Sqrt(2.0); double t = 1.0 / (1.0 + p * x); double erf = 1.0 - (((((a5 * t + a4) * t) + a3) * t + a2) * t + a1) * t * Math.Exp(-x * x); return 0.5 * (1.0 + sign * erf); }

I wrote a little harness to spit out values of CDF for z values from -4.0 to +4.0 (every 0.25), and then copy-pasted the results into Excel and made a graph. I also put values for the logistic sigmoid on the graph to see how they compare.

As the graph shows, both functions range from 0.0 to 1.0. So, it isn’t a surprise that logistic regression classification and probit classification give pretty much the same results. As it turns out, logistic regression is easier to work with, in part because it is easily differentiable which makes model training using calculus based techniques possible. Different fields tend to use one or the other classification technique. For example, in economics, probit seems to be more common, but in other fields, logistic regression seems more common.