Suppose you have this data:
1.0, 2.0, 3.0, 0
4.0, 5.0, 6.0, 0
7.0, 8.0, 9.0, 1
10.0, 11.0, 12.0, 1
13.0, 14.0, 15.0, 1
16.0, 17.0, 18.0, 2
19.0, 20.0, 21.0, 3
22.0, 23.0, 24.0, 3
Each row represents a person. There are 4 classes of people indicated by the last value in each row. There are three predictor variables. The data is artificial but you can imagine the four classes represent job type (engineering, sales, management, operations) and the three predictor variables are sick-days, personal-days-off, and vacation-days. The goal is to predict job type from the predictor values.
There are many machine learning techniques you can use to create a prediction model, including numeric naive Bayes, k-NN, neural network classifiers, etc. One of the most basic techniques is to use a decision tree. The final form of a decision tree will be a set of rules like, “if sick < 15.0 and vacation ≥ 12.0 then job-type = 2”.
Creating a decision tree classifier is not too difficult conceptually but the implementation details are very tricky. (This is the opposite of neural networks which are conceptually extremely deep but implementation is not very difficult).
One of the key ideas when creating a decision tree is repeatedly splitting data into two groups so that the two groups have mostly the same class.
The two most common approaches when splitting data for a decision tree are using Shannon entropy and using Gini impurity. Both are measures of disorder in a set of items. I usually prefer to use impurity but sometimes entropy works slightly better. Suppose you have four classes, 0 to 3, and a set of eight items: (0, 0, 1, 1, 1, 1, 1, 3). The Shannon entropy of a set of items is defined as -1 * Sum[p * log2(p)] where p is the probability of each class. So P(0) = 2/8 = 0.25, P(1) = 4/8 = 0.50, P(2) = 0/8 = 0.00, P(3 = 1/8) = 0.125. The sum of the product of each probability times the log to the base 2 of the probability is:
sum = 0.25 * log2(0.25) +
0.50 * log2(0.50) +
0.00 * log2(0.00) +
0.125 * log2(0.125)
= 0.25 * -2.0000 +
0.50 * -0.5000 +
0.00 * (na) +
0.125 * -0.3750
entropy = -1 * sum = 1.375
You have to avoid trying to compute log2(0) because log to base anything of zero is negative infinity.
Lower values of entropy mean the data items in a set are mostly the same. Higher values of entropy indicate more disorder (the data items aren’t the same). In the extreme, the entropy for a set of items that are all identical is 0.00 — for decision trees lower entropy is better. The largest possible value of Shannon entropy is log2 of the number of classes. For example, if you had 10 items and they were all different, Shannon entropy is log2(10) = 3.3219.
When creating a decision tree you want to split a set of items into two subsets so that the entropy of the class values is low. You could just try different splits, compute the entropies of the items in each of the two partitions, then take the average. This is OK but has a minor downside that partitions of different size are weighted the same. Therefore, it’s usual to average the two entropy values by the number of items in each partition.
For the dummy data above, suppose you decide to split the eight items into the first three items (0, 0, 1) and the last five items (1, 1, 2, 3, 3). The entropy of the first set is 0.9183. The entropy of the second set is 1.5219. The weighted average is (3/8) * 0.9183 + (5/8) * 1.5219 = 1.2956.
Understanding entropy and disorder is the first step in gaining the knowledge you need to implement a decision tree classifier. Next, you need to understand how to search through your data to find a good split. You can’t try all possible splits because of the combinatorial explosion problem, so you have to use a different approach. I’ll explain in a future post.
I wonder if people tend to favor one hand over the other when making the OK sign. I always use my left hand for an OK.