This week I was looking at logistic regression. Logistic regression is a mathematical technique which can be used to predict a dependent variable that is binary, that is, a dependent variable that has only two possible values. The Wikipedia article on logistic regression gives an example where there are three independent variables –- age (in years), sex (0 = male, 1 = female), and cholesterol (where 5.0 is about normal) – and the dependent variable to predict is death (0 = subject did not die, 1 = subject died). I coded up the Wikipedia example using C#. First I wrote a utility function to generate 100 lines of synthetic data:

static void MakeDummyDataFile(string resultFile)

{

random = new Random(0);

FileStream ofs = new FileStream(resultFile, FileMode.Create);

StreamWriter sw = new StreamWriter(ofs);

int age; // above 50

int sex; // 0 = male 1 = female

double cholesterol; // above 5.0

double z;

double probDeath;

int death; // 1 = died, 0 = did not die

double B0 = -5.0;

double B1 = 2.0;

double B2 = -1.0;

double B3 = 1.2;

for (int i = 0; i < 100; ++i) // 100 lines

{

age = random.Next(-40, 51);

sex = random.Next(0, 2);

double low = -4.0; double hi = 3.0;

cholesterolLevel = (hi – low) * random.NextDouble() + low;

z = B0 + (B1 * age) + (B2 * sex) + (B3 * cholesterolLevel);

probDeath = 1.0 / (1.0 + Math.Exp(-1.0 * z));

cholesterolLevel = 5.0 – cholesterolLevel;

death = (probDeath >= 0.50) ? 1 : 0;

double p = random.NextDouble();

if (p < 0.20)

{

if (death == 0)

death = 1;

else if (death == 1)

death = 0;

}

Console.WriteLine((50+age) + ” ” + ( (sex == 0) ? “M” : “F”) +

” ” + cholesterolLevel.ToString(“F2″) + ” ” + death);

sw.WriteLine((50 + age) + ” ” + ((sex == 0) ? “M” : “F”) +

” ” + cholesterolLevel.ToString(“F2″) + ” ” + death);

}

sw.Close();

ofs.Close();

} // MakeDummyDataFile

Here I set B0, B1, B2, B3 to fixed known values, then I compute z which is an intermediate value, then I use z to compute a number between 0.0 and 1.0 which represents the probability the subject died. The resulting data file looks like:

52 M 5.86 1

33 F 4.27 0

(etc.)

Once I had a synthetic data file I wrote code to predict the data. In essence the prediction problem becomes the reverse of the generation problem: I need to determine the (presumably unknown) values of B0, B1, B2, and B3 that best match the data. This is a hard problem and I’m looking at different approaches including “iteratively reweighted least squares” (the standard approach), and particle swarm optimization (an artificial intelligence) technique. Ultimately, dong this exercise has helped me really understand exactly what logistic regression is.