## Calculating the Entropy of Data in a Table or Matrix

I ran into an interesting problem recently. I wanted to calculate the entropy of the data in a table. You can think of entropy as the amount of disorder in a system. So, in most situations, lower entropy is better than higher entropy, assuming you want a system that has some sort of structure.

There are several different equations for entropy. The most commonly used form is called Shannon’s entropy. The equation is:

```H = -Sum[ P(xi) * log2( P(xi)) ]
```

Here H is entropy, xi represent data tokens, P() is probability, and log2 is log to the base 2. For example, suppose you have some data about colors like this: (red, red, blue, green, green, green). The x0, x1, and x2 are red, blue and green. The first probability is P(x0) = P(red) = 2/6 = 0.3333 Similarly, P(x1) = P(blue) = 1/6 = 0.1667, and P(x2) = P(green) = 3/6 = 0.5000. Putting the pieces together:

```H = - [ (0.3333) * log2(0.3333) +
(0.1667) * log2(0.1667) +
(0.5000) * log2(0.5000) ]

= - [ (0.3333) * -1.5850 +
(0.1667) * -2.5850 +
(0.5000) * -1.0000 ]

= - [ (-0.5283 + (-0.4308 + (-0.5000) ]

= - [ -1.4591 ]

= 1.4591
```

If you look at the calculations, you’ll notice two things. The order of the data tokens does not matter. Also, if all tokens are the same you get a minimum entropy of 0 (because log2 of 1.0 is 0.0). OK, so that’s well-known and somewhat interesting. But now suppose you have a table of data as below, and you want to compute the entropy of the entire table:

```Color  Height  Sex
----------------------
Red    Short   Male
Red    Tall    Male
Blue   Medium  Female
Green  Medium  Female
Green  Tall    Female
Green  Short   Male
```

It’s not so obvious how to compute the entropy of the table as a whole. I gave it a lot of thought and searched the Internet, but didn’t come up with a definitive result. After some time I decided the best approach may be to compute the entropy of each column and then simply sum those entropies. This is making an implicit assumption that all the columns are independent of each other, just as Naive Bayes clustering does. A slight variation would be to take the average of the entropies of each column.