K-Means++ Data Clustering

I wrote an article titled “K-Means Data Clustering” in the August 2015 issue of Microsoft’s MSDN Magazine. See https://msdn.microsoft.com/en-us/magazine/mt185575.

Data clustering is the process of grouping data items so that similar items are placed together. Once grouped, the clusters of data can be examined to see if there are relationships that might be useful. For example, if a large set of voting data was clustered, information about the data in each cluster might reveal patterns that could be used for targeted advertising.


There are several clustering algorithms. One of the most common is called the k-means algorithm. There are several variations of this algorithm. My article explains a relatively recent variation (2007) called the k-means++ algorithm.

The k-means++ algorithm uses a technique called proportional fitness selection to initialize the clustering process. There are several ways to implement proportional fitness selection; I used a technique called roulette wheel selection.

To summarize, there are several algorithms to group data. The most common algorithm is called k-means. The k-means algorithm is very sensitive to its initialization process. The k-means++ variant uses a clever initialization scheme called proportional fitness selection. Fitness selection can be implemented in several ways, including roulette wheel selection.

This entry was posted in Machine Learning. Bookmark the permalink.

One Response to K-Means++ Data Clustering

  1. Hello,

    I really enjoyed your MSDN magazine article about k-means++. I was familiar with k-means and with fitness proportionate selection from genetic algorithms, so I found this article very interesting.

    I have a comment about the paragraph explaining roulette wheel selection. Maybe I gave it too quick of a read, but it seems like the example is off slightly. The indexes don’t seem to line up with the values in the example array (cum[i] = 0.4 while the first value in the array is 0.2). The algorithm will still work of course, but it may be harder to people to understand how. Also, p = 0.83, so cum[2] = 0.7 is not greater than p, so I think i = 3 would be returned. My apologies if I am missing something.

    Thanks for the article!

Comments are closed.