Calculating Gini Impurity Example

The Gini Impurity (GI) metric measures the homogeneity of a set of items. GI can be used as part of a decision tree machine learning classifier. The lowest possible value of GI is 0.0. The maximum value of GI depends on the particular problem being investigated, but gets close to 1.0.

Suppose you have 12 items — apples, bananas, cherries. If there are 0 apples, 0 bananas, 12 cherries, then you have minimal impurity (this is good for decision trees) and GI = 0.0. But if you have 4 apples, 4 bananas, 4 cherries, you have maximum impurity and it turns out that GI = 0.667.

Instead of showing the math equation (you can find it on Wikipedia), I’ll show example calculations. Maximum GI:

         apples  bananas  cherries
count =  4       4        4
p     =  4/12    4/12     4/12
      =  1/3     1/3      1/3

GI = 1 - [ (1/3)^2 + (1/3)^2 + (1/3)^2 ]
   = 1 - [ 1/9 + 1/9 + 1/9 ]
   = 1 - 1/3
   = 2/3
   = 0.667

When the number of items is evenly distributed, as in the example above, you have maximum GI but the exact value depends on how many items there are. A bit less than maximum GI:

         apples  bananas  cherries
count =  3       3        6
p     =  3/12    3/12     6/12
      =  1/4     1/4      1/2

GI = 1 - [ (1/4)^2 + (1/4)^2 + (1/2)^2 ]
   = 1 - [ 1/16 + 1/16 + 1/4 ]
   = 1 - 6/16
   = 10/16
   = 0.625

In the example above, the items are not quite evenly distributed, and the GI is slightly less (which is better when used for decision trees). Minimum GI:

         apples  bananas  cherries
count =  0       12        0
p     =  0/12    12/12     0/12
      =  0       1         0

GI = 1 - [ 0^2 + 1^2 + 0^2 ]
   = 1 - [ 0 + 1 + 0 ]
   = 1 - 1
   = 0.00

In the example above, the items are as unevenly distributed as possible, and the GI is the smallest possible value of 0.0 (which is best possible situation when used for decision trees).

The Gini impurity metric is not an acronym, it’s named after mathematician Corrado Gini. The Gini index is not at all the same as a different metric called the Gini coefficient. The Gini impurity metric can be used when creating a decision tree but there are alternatives, including entropy information gain. The advantage of GI is its simplicity.



“Purity” by Italian artist Pino Daeni and “Purity” by Chinese artist Jia Liu. I can create sophisticated software systems but I could never create art like these paintings.

This entry was posted in Machine Learning. Bookmark the permalink.