In the February 2013 issue of MSDN Magazine I wrote an article titled, “Detecting Abnormal Data Using k-Means Clustering”. See http://msdn.microsoft.com/en-us/magazine/jj891054.aspx. There are really two concepts involved. The first concept is k-means clustering. Clustering is a process of placing data items into various groups called clusters so that items within a particular cluster are similar, and items in different clusters are dissimilar. One problem here is defining exactly what similar and dissimilar mean. The second concept is abnormal data detection where you are looking for data items that are in some sense different from other items in a data set. So, the overall idea is to first cluster a data set, and then look for the one data item in each cluster that is most different from the other data items within the cluster.
The k-means algorithm dates back to the 1950s. It works only on purely numerical data (such as people’s heights which can be, for example, 66.5 inches, 70.2 inches, and so on). Basic k-mean does not work on categorical data (such as shirt color which can be, for example, red, blue, and so on).
Although the k-means algorithm is conceptually simple, actually implementing the algorithm is somewhat tricky. In the MSDN article I demonstrate abnormal data detection using k-means clustering on a data set of people’s heights and weights. Technically my article uses k-medoid rather than k-means clustering because the key point in each cluster during algorithm execution is one of the actual data points rather than an average, hypothetical data point.