Calculating Category Utility

Category Utility (CU) is a clever measure of how good a clustering of categorical data is. The equation is (click to enlarge):

CategoryUtilityEquation

Basically, CU is the difference between the probability of guessing an attribute value (like color = red), given a particular clustering, and the probability of guessing the value without any clustering. So, CU can be thought of as a measure of information gain.

Here’s an example of how to calculate category utility. Suppose you have three attributes, color, size, tax. Color can be red, blue, green, or yellow. Size can be small, medium, or large. Tax can be false or true. Let’s say you have five tuples and cluster them into two parts, k = 0 and k = 1, like so:

---------------------
Red    Small    True
Red    Large    False
---------------------
Blue   Medium   True
Green  Medium   True
Green  Medium   False
---------------------

Step 1 – Calculate the probability of each cluster.

P(k = 0) = 2/5 = 0.40
P(k = 1) = 3/5 = 0.60

Step 2 – Calculate the unconditional expectation = sum of squared probabilities of all attribute values across all clusters.

Red    (2/5)^2 = 0.16
Blue   (1/5)^2 = 0.04
Green  (2/5)^2 = 0.16
Yellow (0/5)^2 = 0.00
---
Small  (1/5)^2 = 0.04
Medium (3/5)^2 = 0.36
Large  (1/5)^2 = 0.04
---
False  (2/5)^2 = 0.16
True   (3/5)^2 = 0.36
----
           Sum = 1.32

Step 3 – Calculate conditional expectations for each cluster.

A. For k = 0:

Red    (2/2)^2 = 1.00
Blue   (0/2)^2 = 0.00
Green  (0/2)^2 = 0.00
Yellow (0/2)^2 = 0.00
---
Small  (1/2)^2 = 0.25
Medium (0/2)^2 = 0.00
Large  (1/2)^2 = 0.25
---
False  (1/2)^2 = 0.25
True   (1/2)^2 = 0.25
----
           Sum = 2.00

B. For k = 1:

Red    (0/3)^2 = 0.00
Blue   (1/3)^2 = 0.11
Green  (2/3)^2 = 0.44
Yellow (0/3)^2 = 0.00
---
Small  (0/3)^2 = 0.00
Medium (3/3)^2 = 1.00
Large  (0/3)^2 = 0.00
---
False  (1/3)^2 = 0.11
True   (2/3)^2 = 0.44
----
           Sum = 2.11

Step 4 – Put it all together.

CU = (0.40 * (2.00 - 1.32)) + (0.60 * (2.11 - 1.32)) / 2
   = 0.3733

Coding up a routine to compute category utility is surprisingly tricky. See image below for a demo example.

This entry was posted in Machine Learning, Software Test Automation. Bookmark the permalink.