Basic decision trees have several weaknesses and so there are many enhanced tree models. These include, in order of increasing complexity, bootstrap aggregation (“bagging”), random forest, adaptive boosting (“AdaBoost”), and gradient boosting. There are many variations of each of the four enhanced tree models.
Note: Gradient boosting is an advanced form of AdaBoost, and XGBoost (“extreme boosting”) is an advanced form of gradient boosting. The XGBoost algorithm is not directly implemented in the scikit library.
I put together a demo of the scikit AdaBoost module.
In very high-level pseudo-code, the AdaBoost algorithm looks like:
create a primitive decision stub tree loop 50 times create a new weighted decison stub add new stub to ensemble end-loop model = majority vote of the 50 trees
The pseudo-code omits many important details. Here’s another version of pseudo-code that has more details. It assumes a binary classification scenario, where the two classes are coded as -1 and +1.
There are several variations of AdaBoost. They’re all fairly complex but the Wikipedia article on AdaBoost is pretty good (unlike many Wikipedia machine learning articles).
I put together a demo. I used one of my standard multi-class classification problems. The data looks like:
1 0.24 1 0 0 0.2950 2 0 0.39 0 0 1 0.5120 1 1 0.63 0 1 0 0.7580 0 0 0.36 1 0 0 0.4450 1 . . .
Each line of data represents a person. The fields are sex (male = 0, female = 1), age (normalized by dividing by 100), state (Michigan = 100, Nebraska = 010, Oklahoma = 001), annual income (divided by 100,000), and politics type (0 = conservative, 1 = moderate, 2 = liberal). The goal is to predict politics type from sex, age, state, income. There are 200 training items and 40 test items.
The signature of the AdaBoost module constructor is deceptively simple:
# AdaBoostClassifier(base_estimator=None, *, n_estimators=50, # learning_rate=1.0, algorithm='SAMME.R', # random_state=None)
The actual parameter complexity comes from the internal DecisionTreeClassifier which is used by default as the estimator:
# DecisionTreeClassifier(*, criterion='gini', # splitter='best', max_depth=None, min_samples_split=2, # min_samples_leaf=1, min_weight_fraction_leaf=0.0, # max_features=None, random_state=None, # max_leaf_nodes=None, min_impurity_decrease=0.0, # class_weight=None, ccp_alpha=0.0)
For my demo I created an AdaBoost classifier with all default parameters except for supplying a random_state value so that results are reproducible:
print("Creating AdaBoost model using default params ") from sklearn.tree import DecisionTreeClassifier classifier = DecisionTreeClassifier(max_depth=1) model = AdaBoostClassifier(base_estimator=classifier, n_estimators=50, learning_rate=1.0, random_state=1) model.fit(train_x, train_y) print("Done ")
The results weren’t very good. As usual with most tree-based classifiers, prediction accuracy on the training data is very good but the model was overfitted and had poor accuracy on the test data.
Variations of decision tree classifiers are seductive in the sense that they’re very simple and easy to understand. But neural network classifiers have enabled the fantastic breakthroughs in artificial intelligence and machine learning. Even so, tree-based classifiers can still be useful in many real-world scenarios.
Good fun.
Machine learning decision tree models are seductive to newcomers to machine learning but tree models often turn out well. Female alien seduction in science fiction movies usually doesn’t turn out well for the seductee.
Left: In “Lifeforce” (1985) the crew of a space shuttle discovers a huge alien spaceship with the bodies of two men and a woman. Sure, let’s bring them to Earth. Unfortunately, all three are space vampires, including the one female known as Space Girl. This is a pretty good movie.
Center: In “Queen of Blood” (1966), a crew from Earth goes to Mars and discovers a crashed alien spaceship. There’s a female alien inside. Sure, let’s bring her back to Earth. Unfortunately she is a space vampire who can seduce with glowing eyes. Not a bad movie if you’re a fan of old sci-fi B quality movies like I am.
Right: In “Species” (1995) scientists receive information from aliens about how to splice their DNA with human DNA. Sure, let’s try that on Earth. Unfortunately, the result is a super alien woman named Sil who wants to reproduce. The consequences of mating with Sil are not pleasant for her male victims. This is a surprisingly good movie.
Demo code below. The training and test data can be found at https://jamesmccaffrey.wordpress.com/2023/02/13/multi-class-classification-using-a-scikit-decision-tree.
# people_politics_adaboost.py # predict politics (0 = con, 1 = mod, 2 = lib) # from sex, age, state, income. # uses AdaBoost ("adaptive boosting") algorithm # sex age state income politics # 0 0.27 0 1 0 0.7610 2 # 1 0.19 0 0 1 0.6550 0 # sex: 0 = male, 1 = female # state: michigan = 100, nebraska = 010, oklahoma = 001 # politics: conservative, moderate, liberal # Anaconda3-2022.10 Python 3.9.13 scikit 1.0.2 # Windows 10/11 import numpy as np from sklearn.ensemble import AdaBoostClassifier # --------------------------------------------------------- def show_confusion(cm): dim = len(cm) mx = np.max(cm) # largest count in cm wid = len(str(mx)) + 1 # width to print fmt = "%" + str(wid) + "d" # like "%3d" for i in range(dim): print("actual ", end="") print("%3d:" % i, end="") for j in range(dim): print(fmt % cm[i][j], end="") print("") print("------------") print("predicted ", end="") for j in range(dim): print(fmt % j, end="") print("") # --------------------------------------------------------- def main(): # 0. get ready print("\nBegin scikit AdaBoost example ") print("Predict politics from sex, age, State, income ") np.random.seed(1) np.set_printoptions(precision=4, suppress=True) # sex age state income politics # 0 0.27 0 1 0 0.7610 2 # 1 0.19 0 0 1 0.6550 0 # 1. load data print("\nLoading data into memory ") train_file = ".\\Data\\people_train.txt" train_xy = np.loadtxt(train_file, usecols=range(0,7), delimiter="\t", comments="#", dtype=np.float32) train_x = train_xy[:,0:6] train_y = train_xy[:,6].astype(int) test_file = ".\\Data\\people_test.txt" test_xy = np.loadtxt(test_file, usecols=range(0,7), delimiter="\t", comments="#", dtype=np.float32) test_x = test_xy[:,0:6] test_y = test_xy[:,6].astype(int) print("\nTraining data:") print(train_x[0:4]) print(". . . \n") print(train_y[0:4]) print(". . . ") # --------------------------------------------------------- # 2. create and train # AdaBoostClassifier(estimator=None, *, n_estimators=50, # learning_rate=1.0, algorithm='SAMME.R', # random_state=None, base_estimator='deprecated') # DecisionTreeClassifier(*, criterion='gini', # splitter='best', max_depth=None, min_samples_split=2, # min_samples_leaf=1, min_weight_fraction_leaf=0.0, # max_features=None, random_state=None, # max_leaf_nodes=None, min_impurity_decrease=0.0, # class_weight=None, ccp_alpha=0.0) print("\nCreating AdaBoost model using default params ") from sklearn.tree import DecisionTreeClassifier classifier = DecisionTreeClassifier(max_depth=1) model = AdaBoostClassifier(base_estimator=classifier, n_estimators=50, learning_rate=1.0, random_state=1) model.fit(train_x, train_y) print("Done ") # 3. evaluate acc_train = model.score(train_x, train_y) print("\nAccuracy on train = %0.4f " % acc_train) acc_test = model.score(test_x, test_y) print("Accuracy on test = %0.4f " % acc_test) # 3b. display formatted confusion matrix from sklearn.metrics import confusion_matrix y_predicteds = model.predict(test_x) cm = confusion_matrix(test_y, y_predicteds) print("\nConfusion matrix: \n") show_confusion(cm) # 4. use model print("\nPredict for: M 35 Nebraska $55K ") X = np.array([[0, 0.35, 0,1,0, 0.5500]], dtype=np.float32) probs = model.predict_proba(X) print("\nPrediction pseudo-probs: ") print(probs) politic = model.predict(X) print("\nPredicted class: ") print(politic) # 5. TODO: save model using pickle print("\nEnd scikit AdaBoost demo ") if __name__ == "__main__": main()
You must be logged in to post a comment.