Like many software engineers, I obsess over small things and it drives me crazy when there are details about some topic that I don’t fully understand. Kernel logistic regression (KLR) has had my attention for days now, but I think I can finally put the topic out of my mind because I completely grasp the calculations.
Suppose you want to predict the binary class (0 or 1) for input = (3.0, 5.0). You have four training items and you run KLR code to create a prediction model.
train data a K a*K (2.0, 4.0, 0) -0.3 0.3678 -0.1103 (4.0, 1.0, 1) 0.4 0.0002 0.0001 (5.0, 3.0, 0) -0.2 0.0183 -0.0036 (6.0, 7.0, 1) 0.6 0.0015 0.0009 0.1 0.1 ------ z = -0.1120 p = 0.4720 C = 0
The KLR training process generates an alpha (a) value associated with each training data (the alphas are -0.3, 0.4, -0.2, 0.6), and a separate bias value (0.1 above). There is a kernel value K, for each training data item that is a measure of similarity between the training item and the item you’re trying to make the prediction for — (3.0, 5.0) here.
To make the prediction you multiply each a times each K, sum those products, and add the bias (z = -0.1120). Then you calculate p = 1 / (1 + exp(-z)) = 1 / (1 + exp(0.1120)) = 0.4720.
The p value will be between 0.0 and 1.0. If p 0.5 the prediction is C = 1.
There are several kernel functions. This example uses the radial basis function (RBF) kernel with sigma = 1.0 which is a topic all by itself. If K = 1.0 it means two items are identical. The closer K comes to 0.0 the larger the difference between the two items.
Behind the scenes, during KLR training, the kernel function must be calculated for all pairs of training items so KLR may not be feasible if the number of training items is large. This weakness, in part, led to the development of SVM (support vector machines) which have some similarity to KLR.