I was thinking about coding up a demo of the k-nn (“k nearest neighbors) classifier algorithm. The k-nn algorithm is almost too simple. The goal is to predict the class of something, for example, the political leaning (conservative, moderate, liberal) of a person based on age, income, and number dependents. You get a bunch of data that has known class labels, then set k (suppose k = 3), then find the k closest data items to your unknown data, then “take a vote” to determine the predicted class.

Well I was thinking about the part of the algorithm where you find the k nearest neighbors. The simplest approach is to calculate the distances from the item-to-predict and all known data items, then sort from nearest to farthest, then use the k nearest.

Anyway, there are a lot of implementation details. One approach would be to create an object that holds data values plus an index that IDs each item. For languages like C# this is a good approach because you can automatically (well, almost) sort objects.

An older approach is to set up two parallel arrays where one of the arrays holds distance values and the other array holds index IDs. Then you can sort on the distance array but every time two cells are exchanged, you also exchange the corresponding IDs in the other array.

For example, suppose you have just five data items and item [0] is 4.0 units away from the item to classify, and items [1] thru [4] have distances of 0.0, 2.0, 3.0, 1.0 respectively. The distance array would be { 4.0, 0.0, 2.0, 3.0, 1.0 } and the IDs array would be { 0, 1, 2, 3, 4 }. After sorting you’d have:

0.0 1.0 2.0 3.0 4.0
1 4 2 3 0

Item [1] is closest, item [4] is second closet and so on. Just for fun, I coded up a little demo that implements the quicksort algorithm, but on two arrays. Good fun!

### Like this:

Like Loading...

*Related*