I ran into an interesting problem recently. I wanted to calculate the entropy of the data in a table. You can think of entropy as the amount of disorder in a system. So, in most situations, lower entropy is better than higher entropy, assuming you want a system that has some sort of structure.
There are several different equations for entropy. The most commonly used form is called Shannon’s entropy. The equation is:
H = -Sum[ P(xi) * log2( P(xi)) ]
Here H is entropy, xi represent data tokens, P() is probability, and log2 is log to the base 2. For example, suppose you have some data about colors like this: (red, red, blue, green, green, green). The x0, x1, and x2 are red, blue and green. The first probability is P(x0) = P(red) = 2/6 = 0.3333 Similarly, P(x1) = P(blue) = 1/6 = 0.1667, and P(x2) = P(green) = 3/6 = 0.5000. Putting the pieces together:
H = - [ (0.3333) * log2(0.3333) + (0.1667) * log2(0.1667) + (0.5000) * log2(0.5000) ] = - [ (0.3333) * -1.5850 + (0.1667) * -2.5850 + (0.5000) * -1.0000 ] = - [ (-0.5283 + (-0.4308 + (-0.5000) ] = - [ -1.4591 ] = 1.4591
If you look at the calculations, you’ll notice two things. The order of the data tokens does not matter. Also, if all tokens are the same you get a minimum entropy of 0 (because log2 of 1.0 is 0.0). OK, so that’s well-known and somewhat interesting. But now suppose you have a table of data as below, and you want to compute the entropy of the entire table:
Color Height Sex ---------------------- Red Short Male Red Tall Male Blue Medium Female Green Medium Female Green Tall Female Green Short Male
It’s not so obvious how to compute the entropy of the table as a whole. I gave it a lot of thought and searched the Internet, but didn’t come up with a definitive result. After some time I decided the best approach may be to compute the entropy of each column and then simply sum those entropies. This is making an implicit assumption that all the columns are independent of each other, just as Naive Bayes clustering does. A slight variation would be to take the average of the entropies of each column.
I still need to think about this some more.