Category Utility (CU) is a clever measure of how good a clustering of categorical data is. The equation is (click to enlarge):

Basically, CU is the difference between the probability of guessing an attribute value (like color = red), given a particular clustering, and the probability of guessing the value without any clustering. So, CU can be thought of as a measure of information gain.

Here’s an example of how to calculate category utility. Suppose you have three attributes, color, size, tax. Color can be red, blue, green, or yellow. Size can be small, medium, or large. Tax can be false or true. Let’s say you have five tuples and cluster them into two parts, k = 0 and k = 1, like so:

---------------------
Red Small True
Red Large False
---------------------
Blue Medium True
Green Medium True
Green Medium False
---------------------

Step 1 – Calculate the probability of each cluster.

P(k = 0) = 2/5 = 0.40
P(k = 1) = 3/5 = 0.60

Step 2 – Calculate the unconditional expectation = sum of squared probabilities of all attribute values across all clusters.

Red (2/5)^2 = 0.16
Blue (1/5)^2 = 0.04
Green (2/5)^2 = 0.16
Yellow (0/5)^2 = 0.00
---
Small (1/5)^2 = 0.04
Medium (3/5)^2 = 0.36
Large (1/5)^2 = 0.04
---
False (2/5)^2 = 0.16
True (3/5)^2 = 0.36
----
Sum = 1.32

Step 3 – Calculate conditional expectations for each cluster.

A. For k = 0:

Red (2/2)^2 = 1.00
Blue (0/2)^2 = 0.00
Green (0/2)^2 = 0.00
Yellow (0/2)^2 = 0.00
---
Small (1/2)^2 = 0.25
Medium (0/2)^2 = 0.00
Large (1/2)^2 = 0.25
---
False (1/2)^2 = 0.25
True (1/2)^2 = 0.25
----
Sum = 2.00

B. For k = 1:

Red (0/3)^2 = 0.00
Blue (1/3)^2 = 0.11
Green (2/3)^2 = 0.44
Yellow (0/3)^2 = 0.00
---
Small (0/3)^2 = 0.00
Medium (3/3)^2 = 1.00
Large (0/3)^2 = 0.00
---
False (1/3)^2 = 0.11
True (2/3)^2 = 0.44
----
Sum = 2.11

Step 4 – Put it all together.

CU = (0.40 * (2.00 - 1.32)) + (0.60 * (2.11 - 1.32)) / 2
= 0.3733

Coding up a routine to compute category utility is surprisingly tricky. See image below for a demo example.

### Like this:

Like Loading...

*Related*