The hardest part about this blog post is explaining the problem. The answer is easy: when performing training for Logistic Regression using stochastic gradient ascent/descent, the update rule should be based on using cross entropy loss rather than using squared error loss.
A full explanation of the problem would require pages and pages and pages, so I’m going to have to leave a huge amount of important detail out.
Suppose you are performing binary logistic regression (such as predicting whether a person is Male or Female based on age, income, height, and years of education). There are several ways to train the LR model. In my world at least, the most common training algorithm is stochastic gradient ascent to maximize the log-likelihood. The update rule is:
w = w + lr * x* (t – o)
In words, the new weight is the old weight plus a small learning rate times the associated input value times the target (0 or 1) minus the computed output probability.
But if you assume the base LR equation p = 1.0 / (1.0 + e^-z) where z = b + w0x0 + w1x1 + . . . and then minimize squared error = (t – o)^2 the resulting update rule becomes:
W = w + lr * x * (t – o) * (o * (1 – o))
There is an additional o * (1 – o) term. The first update rule is simpler and concave and therefore easier to solve. The second update rule is more complex, and according to the one paper I was able to find on this topic, the error function is not convex and so is more difficult to solve.
Link to the best-known (and excellent) derivation of the usual LR update rule:
Link to an obscure paper that states the squared error approach is not convex:
By the way, when monitoring error during training, or when evaluating error of a trained LR model, using squared error is considered acceptable, and is called the Brier scoring rule.