The Category Utility Function as a Measure of Data Clustering

I ran across a very cool metric recently. The metric is called category utility and it can be used to measure how well a database of categorical data is clustered into a set of groups. Cluster analysis of numeric data is one of the most widely studied areas in all of data mining. With numeric data it is not too hard to measure how far one row of data, such as (1.0, 2.0, 3.0), is from another row of data, such as (5.0, -1.5, 6.0). Using a difference metric you can then create clusters of data where the data within a cluster is similar, and the data in different clusters is dissimilar. However, it’s not so easy to measure how far categorical data, such as (red, small, hot) is from other categorical data, such as (blue, large, cold). Category utility is a neat idea which measures such differences based on probabilities. The equation is:
Each possible clustering of data will have a different categorical utility value, with larger CU values indicating better clustering. I’m looking at a research paper for the International Symposium on Visual Computing and I coded up an implementation of a category utility function in C# and the following screenshot shows it in action:
This entry was posted in Machine Learning. Bookmark the permalink.