Determining Weights for Weighted k-NN Classification

The k-nearest neighbors classification algorithm is one of the oldest and simplest machine learning techniques. I was exploring the technique rdcently and was mildsly surprised to find very little practical information about how to generate k-NN voting weights.

Suppose k is set to 5. The first part of k-NN calculates the distances from the item-to-be-classified to all the labeled data, then finds the k = 5 nearest/closest labeled data vectors.

For example, suppose the classification problem has 3 classes, and the 5 nearest data vectors are:

data                      dist  class
=====================================
(0.45, 0.32, 0.78, 0.11)  0.1234  1
(0.65, 0.33, 0.92, 0.54)  0.2076  0
(0.52, 0.47, 0.82, 0.55)  0.3588  0
(0.41, 0.78, 0.43, 0.58)  0.5032  1
(0.73, 0.29, 0.44, 0.61)  0.7505  2

Which class, 0, 1, 2, is the item-to-be-classified? Class 1 is the closest, but class 0 is the second and third closest.

The simplest approach is to use uniform voting weights. For k = 5 each weight is 1 / 5 = 0.20 so

Class 0: 2 * 0.20 = 0.40
Class 1: 2 * 0.20 = 0.40
Class 2: 1 * 0.20 = 0.20

This results in a tie between class 0 and class 1. The uniform voting weights approach is equivalent to a simple majority vote.

There are several ways to create weights that give more importance to closer data values. The inverse weights approach computes the inverse of each weight, sums the inverses, then takes each inverse divided by the sum of the inverses. For the data above the weights are:

dist     inverse  wts = inv/sum
===============================
0.1234   8.1037    0.4259  
0.2076   4.8170    0.2532
0.3588   2.7871    0.1465
0.5032   1.9873    0.1044
0.7505   1.3324    0.0700

        19.0275    1.0000

The voting is:

Class 0: 0.2532 + 0.1465 = 0.5303
Class 1: 0.4259 + 0.1044 = 0.3996
Class 2: 0.0700          = 0.0700

And so the conclusion is the item-to-be-classified is class 0.

I usually use the inverse weights approach but there are other weighting techniques for k-NN. One alternative is to use the ranks of the distances and compute rank order centroids. This weights closer labeled data more heavily than the inverse technique. Another approach is to sum the distances, take each distance divided by the sum, then reverse the order of the weights. This penalizes big distances more heavily than the inverse technique.



I find high fashion quite interesting. I couldn’t even begin to assign quality weights to the dresses at a fashion show.

Advertisements
This entry was posted in Machine Learning. Bookmark the permalink.