Logistic regression (LR) is a machine learning technique that makes predictions in situations where the outcome can be one of two possible values. For example, you might want to predict if a person died or not (Y), based on Age, Sex, Cholesterol.

The equation for a LR prediction for this example with three x-values is p = 1 / (1 + exp(-z)) where z = b0 + (b1)(x1) + (b2)(x2) + (b3)(x3). The result is a probability between 0.0 and 1.0. If p is less than 0.5 you predict 0 (didn’t die), but if p is greater than 0.5 you predict 1 (did die).

Suppose b0 = 1.00, b1 = 0.01, b2 = 0.01, b3 = 0.01 and x1 = 48 (age), x2 = 1 (female), x3 = 4.40 (cholesterol). Then z = (1.00) + (0.01)(48) + (0.01)(1) + (0.01)(4.40) and p = 1 / (1 + exp(-z)) = 0.8226 (died).

The process of finding the b-values (usually called the weights) is called training the model. You take known data and use some math algorithm to find the values of the weights that create predictions that best match the known correct Y values.

Training is difficult. There are many optimization algorithms you can use but the two of the most common are the stochastic gradient descent algorithm and the Newton-Raphson algorithm. Both algorithms are iterative and you compute a new set of b-values in each iteration.

The equation for Newton-Raphson for LR is intimidating at first glance, but is simple once you understand it:

Here b is a 1-column matrix of the b-values. The X matrix is a so-called design matrix — the input values with a leading column of 1.0s added. The W matrix holds (p)(1-p) values on the diagonal, 0.0s elsewhere. Capital T indicates matrix transposition. The inv() indicates matrix inversion. The Y matrix is the known, correct output 0 or 1 values. The p matrix holds the calculated probabilities.

I did a single iteration using R. My X design matrix, and Y matrix of known values were:

X Y --------------------- 1.0 48 1 4.40 0 1.0 60 0 7.89 1 1.0 51 0 3.48 0 1.0 66 0 8.41 1 1.0 40 1 3.05 0

My initial b-values were (1.00, 0.01, 0.01, 0.01). Using the X and the b-values, the calculated p-values are (0.8226 0.8428, 0.8242, 0.8512, 0.8085) which means the initial predictions are all 1 (died). Using just one iteration of Newton-Raphson, the new b-values are (5.12, -0.34, -2.82, 2.36) which gives new p-values of (0.02, 0.96, 0.02, 0.92, 0.02), which is very, very close to the correct Y-values of (0, 1, 0, 1, 0). After a few more iterations, the new b-values would give a very good predictive model.

Perhaps the main weakness of Newton-Raphson is that it requires matrix inversion, and inversion can easily fail.