Basic decision trees have several weaknesses and so there are many enhanced tree models. These include, in order of increasing complexity, bootstrap aggregation (“bagging”), random forest, adaptive boosting (“AdaBoost”), and gradient boosting. There are many variations of each of the four enhanced tree models.
I put together a demo of the scikit random forest module.
In very high-level pseudo-code, scikit default random forest is:
loop N times fetch a random subset of training data create a basic decision tree from subset end-loop model = majority vote of the N trees
The random subset of the training data items is selected by picking only some of the predictor variables. For example, if there are 6 predictor variables, then each tree might be based on just 3 of the predictors. The scikit default is the square root of the number of predictors (rounded to the nearest integer).
The idea is that the order of the data will change for each sub-tree and robustness is introduced by looking at different sets of predictors. These two operations reduce model overfitting which is the major weakness of tree classifiers
I put together a demo. I used one of my standard multi-class classification problems. The data looks like:
1 0.24 1 0 0 0.2950 2 0 0.39 0 0 1 0.5120 1 1 0.63 0 1 0 0.7580 0 0 0.36 1 0 0 0.4450 1 . . .
Each line of data represents a person. The fields are sex (male = 0, female = 1), age (normalized by dividing by 100), state (Michigan = 100, Nebraska = 010, Oklahoma = 001), annual income (divided by 100,000), and politics type (0 = conservative, 1 = moderate, 2 = liberal). The goal is to predict politics type from sex, age, state, income. There are 200 training items and 40 test items.
The signature of the random forest module constructor is complex:
# RandomForestClassifier(n_estimators='warn', # criterion='gini', max_depth=None, min_samples_split=2, # min_samples_leaf=1, min_weight_fraction_leaf=0.0, # max_features='auto', max_leaf_nodes=None, # min_impurity_decrease=0.0, min_impurity_split=None, # bootstrap=True, oob_score=False, n_jobs=None, # random_state=None, verbose=0, warm_start=False, # class_weight=None)
It would take at least a couple of pages to explain all these parameters but the two most important are the n_estimators (number of trees) and max_features (number of randomly selected predictors for each tree). Also important is the random_state parameter to get reproducible results. For my demo I tried:
print("Creating RandomForestClassifier model ") model = RandomForestClassifier(n_estimators=10, max_features=3, random_state=1) model.fit(train_x, train_y) print("Done ")
The results weren’t very good: as usual with most tree-based classifiers, prediction accuracy on the training data is excellent, but the model was overfitted and had poor accuracy on the test data.
Among my machine learning colleagues, guys tend to fall into one of two buckets: those who use mostly tree technques and those who use mostly neural techniques. I tend to use neural techniques but I’ll often look at a tree model too to see if the models agree.
Good fun.
One of my favorite movie genres is fantasy. Many fantasy films feature memorable forest scenes. Here are three forest scenes randomly selected from my memory. Left: In “The Fellowship of the Ring” (2001), the Hobbits are pursued by the Dark Riders in the forest. Very scary! Center: In “Labyrinth” (1986), Sarah is searching for her baby brother who was stolen by Jareth, the Goblin King. Not scary. Right: In “The Brothers Grimm” (2005), Wilhelm and Jacob must go through a very evil forest with very evil trees to get to the castle of the very evil queen. Very scary.
Demo code below. The training and test data can be found at https://jamesmccaffrey.wordpress.com/2023/02/13/multi-class-classification-using-a-scikit-decision-tree/
# people_politics_forest.py # predict politics (0 = con, 1 = mod, 2 = lib) # from sex, age, state, income. # uses random forest # sex age state income politics # 0 0.27 0 1 0 0.7610 2 # 1 0.19 0 0 1 0.6550 0 # sex: 0 = male, 1 = female # state: michigan = 100, nebraska = 010, oklahoma = 001 # politics: conservative, moderate, liberal # Anaconda3-2022.10 Python 3.9.13 scikit 1.0.2 # Windows 10/11 import numpy as np from sklearn.ensemble import RandomForestClassifier # --------------------------------------------------------- def show_confusion(cm): dim = len(cm) mx = np.max(cm) # largest count in cm wid = len(str(mx)) + 1 # width to print fmt = "%" + str(wid) + "d" # like "%3d" for i in range(dim): print("actual ", end="") print("%3d:" % i, end="") for j in range(dim): print(fmt % cm[i][j], end="") print("") print("------------") print("predicted ", end="") for j in range(dim): print(fmt % j, end="") print("") # --------------------------------------------------------- def main(): # 0. get ready print("\nBegin scikit random forest example ") print("Predict politics from sex, age, State, income ") np.random.seed(1) np.set_printoptions(precision=4, suppress=True) # sex age state income politics # 0 0.27 0 1 0 0.7610 2 # 1 0.19 0 0 1 0.6550 0 # 1. load data print("\nLoading data into memory ") train_file = ".\\Data\\people_train.txt" train_xy = np.loadtxt(train_file, usecols=range(0,7), delimiter="\t", comments="#", dtype=np.float32) train_x = train_xy[:,0:6] train_y = train_xy[:,6].astype(int) test_file = ".\\Data\\people_test.txt" test_xy = np.loadtxt(test_file, usecols=range(0,7), delimiter="\t", comments="#", dtype=np.float32) test_x = test_xy[:,0:6] test_y = test_xy[:,6].astype(int) print("\nTraining data:") print(train_x[0:4]) print(". . . \n") print(train_y[0:4]) print(". . . ") # --------------------------------------------------------- # 2. create and train # RandomForestClassifier(n_estimators='warn', # criterion='gini', max_depth=None, min_samples_split=2, # min_samples_leaf=1, min_weight_fraction_leaf=0.0, # max_features='auto', max_leaf_nodes=None, # min_impurity_decrease=0.0, min_impurity_split=None, # bootstrap=True, oob_score=False, n_jobs=None, # random_state=None, verbose=0, warm_start=False, # class_weight=None) print("\nCreating RandomForestClassifier model ") model = RandomForestClassifier(n_estimators=10, max_features=3, random_state=1) model.fit(train_x, train_y) print("Done ") # 3. evaluate acc_train = model.score(train_x, train_y) print("\nAccuracy on train = %0.4f " % acc_train) acc_test = model.score(test_x, test_y) print("Accuracy on test = %0.4f " % acc_test) # 3b. display formatted confusion matrix from sklearn.metrics import confusion_matrix y_predicteds = model.predict(test_x) cm = confusion_matrix(test_y, y_predicteds) print("\nConfusion matrix: \n") show_confusion(cm) # 4. use model print("\nPredict for: M 35 Nebraska $55K ") X = np.array([[0, 0.35, 0,1,0, 0.5500]], dtype=np.float32) probs = model.predict_proba(X) print("\nPrediction pseudo-probs: ") print(probs) politic = model.predict(X) print("\nPredicted class: ") print(politic) # 5. TODO: save model using pickle print("\nEnd scikit random forest demo ") if __name__ == "__main__": main()
You must be logged in to post a comment.