Even among my engineering colleagues who work with machine learning quite often, the basic ideas behind support vector machine (SVM) classification are a bit hazy. I believe this mild confusion is due, in part, to the fact that SVM classification has a large number of ideas. Each idea is simple by itself, but the sheer number of ideas leads to some confusion.
If you’ve ever tried to read an article that explains SVMs, you’re likely overwhelmed because to truly present SVMs, an author has to explain pages and pages of preliminaries.
This blog post is my effort to explain SVM classification in as few sentences as possible.
One possible form of the prediction equations for SVM classification is:
Here xi is the (vector of values) to predict. The xj are so-called support vectors which are a subset of the training data. The yj is the class (-1 or +1) of each data xj. The aj are constants, one for each xj. The b is a single numeric constant. Letter s is the number of support vectors. The K is a kernel function that accepts two vectors and returns a single number that is a measure of similarity between the two vectors, where 1.0 means identical and 0.0 means as different as possible.
Suppose you have 100 training items. Using part of the SVM algorithm, you determine that only s = 2 of the training items are the special support vectors:
(1) 5.0 3.0 2.0 +1 (2) 4.0 6.0 1.0 -1
And using the SVM algorithm, you somehow determine that the two aj values are (0.8, 0.7). And you determine that b = 0.75.
Suppose the kernel function used (there are dozens of such functions) for all this is the radial basis function (RBF), which is the second equation above.
The notation in the numerator means squared Euclidean distance. The gamma in the denominator is a constant — suppose it’s 1.0.
So if you want to predict the class (+1 or -1) of a new item that has predictor values (8.0, 5.0, 4.0), the calculations are:
j = 1: (0.8)(+1) * K([5,3,2], [8,5,4]) + 0.75 = (0.8) * 8.5 + 0.75 = 7.55 j = 2: (0.7)(-1) * K([4,6,1], [8,5,4]) + 0.75 = (-0.7) * 13.0 + 0.75 = -8.35
Then 7.55 + -8.35 = -0.80 which is negative, so the predicted class is -1.
So, SVM classification involves first choosing a kernel function and any parameters that function has (such as gamma for RBF), then using a complex algorithm, usually SMO (“sequential minimal optimization”) to determine which training items are the special support vectors, and the aj values for each support vector, and the b constant.