Cross Entropy Error and Logistic Regression

I was thinking about cross entropy error and its relationship to logistic regression. A full explanation would take dozens of pages so I’ll be relatively brief at the cost of leaving out a lot of important details.

Logistic regression is a machine learning technique used to predict something that can be just one of two possible values. For example, you might want to predict the sex (y) (where male = 0, female = 1) of a person based on their age (x0) and income (x1).

The logistic regression model is:

y = 1 / (1 + e^-z)

z = a + (b0)(x0) + (b1)(x1) + (b2)(x2) + . . 

The a and bi values are numeric constants called the weights. The value of y will always be between 0.0 and 1.0. If the calculated y is closer to 0 (less than or equal to 0.5) then the prediction is 0. If the calculated y is closer to 1 (greater than 0.5) then the prediction is 1.

For example, if a = 1.0, b0 = 0.1, b1 = -0.2, b2 = 0.3, and x0 = 1, x1 = 2, x2 = 3 then:

z = 1.0 + (0.1)(1) + (-0.2)(2) + (0.3)(3)
  = 3.70
y = 1 / (1 + e^-3.7)
  = 0.0241
prediction = 0

OK, the xi values are inputs but where do the a and bi values come from? You must get so-called training data that has known input and output values and then use a math algorithm to find the values of a and bi so that the calculated y values are as close as possible to the known correct y values (called the desired values) in the data.

The two most common training algorithms for logistic regression are “Newton-Raphson” and “gradient descent”.

As it turns out, the theory of logistic regression assumes that the goal is to find the a and bi values that maximize the probability of seeing the observed y values. This is called maximum likelihood expectation. But this is (almost) mathematically equivalent to minimizing what is called the cross entropy error.

There are actually several forms of cross entropy error (CEE). For logistic regression, cross entropy is best explained by an example. Suppose:

inputs       calc y  desired y
(-3, 6, -1)  0.6900   1
(1, 2, 3)    0.1680   0

The CEEs for the two data items are:

CEE = - [ ln(0.6900)(1) + ln(0.3100)(0) ]
    = 0.3711
CEE = - [ ln(0.1680)(0) + ln(0.8320)(1)
    = 0.1839

In words you use the ln() of the calculated y times the desired y, and the ln() of (1 – calculated y) and (1 – desired y).

So, what’s the point? Well, when using Newton-Raphson or gradient descent for training a logistic regression model to find the values of a and the bi, the cross entropy error is used implicitly — you don’t see the calculations.

BUT, suppose you wanted to use some other form of training, for example swarm optimization. In such a situation you still want to minimize cross entropy error and so you’d have to calculate CEE explicitly.

Whew! That’s a lot of information. Now what got me thinking about this all in the first place is that I was wondering about using explicit mean squared error (MSE) instead of cross entropy error to train a logistic regression model. Theory says that using MSE isn’t as good as using CEE.

The example in the image shows a little experiment I did. The first model is OK and has MSE = 0.0709 and CEE = 0.2921. The second model fits the data better and has MSE = 0.0219 and CEE = 0.1245. Both MSE and CEE are lower for the better model. I strongly suspect that training a logistic regression model using explicit MSE with swarm optimization would give you a model that’s as good in practice as a model trained using implicit or explicit CEE.


This entry was posted in Machine Learning. Bookmark the permalink.