I wrote an article titled “Data Clustering using Entropy Minimization” in the February 2013 issue of Visual Studio Magazine. See http://visualstudiomagazine.com/Articles/2013/02/01/Data-Clustering-Using-Entropy-Minimization.aspx. The idea of clustering is quite simple — group data items together so that items in a cluster are similar to each other and different from items in other clusters. Although conceptually simple, actually implementing a data clustering algorithm is very difficult. Clustering has many important practical applications.
In the VSM article I describe a previously unpublished clustering technique (as far as I can tell anyway) that is based on the idea of entropy. The entropy of a set of data is a numerical measure of the amount of disorder in a system. An entropy value of 0 means no disorder. Larger values of entropy indicate more disorder.
As I describe in the VSM article, there are two problems to solve in any clustering algorithm. The first problem is to come up with a way to measure how good a particular clustering is. That’s where entropy comes in. The second problem is how to search through all possible clusterings to find the best clustering. As it turns out, this is a difficult problem because, except in trivial situations, the number of possible clusterings for a data set is astronomically large.