Data clustering is the process of programmatically grouping data items together so that similar items belong to the same cluster and dissimilar items belong to different clusters. There are many different clustering algorithms. By far the most common clustering algorithm is called the k-means algorithm. Although k-means has been around for decades, and is relatively simple, there are surprisingly few good implementations available on the Internet — a lot of k-means clustering code on the Web is incomplete or just plain wrong.
I wrote an article “K-Means Data Clustering using C#” which appears in the December 2013 issue of Visual Studio Magazine. See http://visualstudiomagazine.com/articles/2013/12/01/k-means-data-clustering-using-c.aspx. The k-means algorithm applies only in situations where the data to be clustered is completely numeric. In the article I use an example of clustering 20 data items where each item is a person’s height (in inches) and weight (in pounds). For example, the first data item is (65.0, 220.0).
Some of the details to pay attention to when implementing k-means are normalizing the data, preventing a cluster group from having no data items assigned to it, finding a good way to initialize the algorithm, and preventing an infinite processing loop. The VS Magazine article addresses all these issues.