The Learning Update Rule for Kernel Logistic Regression

Regular logistic regression (LR) can make predictions for binary data (0 or 1) based on two or more numeric variables. For example you might want to use LR to predict if a person is Male (0) or Female (1) based on years of education (x1) and annual income (x2).

The LR prediction equation is:

Using some clever math, you can find the values of the b-weights using stochastic gradient descent. The learning rule update equation for regular LR is:

In words, read a training data item i then update each b-weight by calculating the difference between the target value (0 or 1) and the computed y value (using the current b-weight values) times eta (a small learning rate like 0.01) times the associated x value. If you do this enough times you’ll find the values of the b-weights.

Unfortunately, regular LR only works for simple linearly separable data. But you can use what’s called “the kernel trick” to create kernel logistic regression (KLR) that can handle data that is non linearly separable.

The KLR prediction equation is:

Each training data item has an associated weight a. Instead of using the plain input x-values, you use a sum of a kernel function K that calculates the similarity between the input x-values and all of the training data. (Note: I’ve really abused the math notation here to try and keep things simple). The goal now is to find the a-weights.

Now here’s my point. In KLR, the kernel function acts like the regular X input of regular LR, and the a-weights act like the b-weights. So, by hand-waving analogy, the update rule for kernel logistic regression is:

The learning rule update equation for regular LR is:

This is really quite complicated, and the only way I could really understand what is going on, was to code a demo program up.

As it turns out, KLR has some basically fatal flaws. Mostly because the Kernel function must compare each training item with every other item, it’ not practical for really large data sets. There are several other drawbacks too, which results in KLR just not being used very often.

Advertisements
This entry was posted in Machine Learning. Bookmark the permalink.