Data Clustering using Entropy Minimization

I wrote an article titled “Data Clustering using Entropy Minimization” in the February 2013 issue of Visual Studio Magazine. See The idea of clustering is quite simple — group data items together so that items in a cluster are similar to each other and different from items in other clusters. Although conceptually simple, actually implementing a data clustering algorithm is very difficult. Clustering has many important practical applications.

In the VSM article I describe a previously unpublished clustering technique (as far as I can tell anyway) that is based on the idea of entropy. The entropy of a set of data is a numerical measure of the amount of disorder in a system. An entropy value of 0 means no disorder. Larger values of entropy indicate more disorder.

As I describe in the VSM article, there are two problems to solve in any clustering algorithm. The first problem is to come up with a way to measure how good a particular clustering is. That’s where entropy comes in. The second problem is how to search through all possible clusterings to find the best clustering. As it turns out, this is a difficult problem because, except in trivial situations, the number of possible clusterings for a data set is astronomically large.


This entry was posted in Machine Learning. Bookmark the permalink.