Basic decision trees have several weaknesses and so there are many enhanced tree models. These include, in order of increasing complexity, bootstrap aggregation (“bagging”), random forest, adaptive boosting (“AdaBoost”), and gradient boosting. There are many variations of each of the four enhanced tree models.
In high-level pseudo-code, scikit default bagging is:
loop 10 times fetch a random subset of training data create a basic decision tree from subset end-loop model = majority vote of the 10 trees
By default, each random subset of the N training data items is selected by picking N items with replacement. The idea is that the order of the data will change and some data items will not be picked. These two operations reduce model overfitting.
I put together a demo. I used one of my standard multi-class classification problems. The data looks like:
1 0.24 1 0 0 0.2950 2 0 0.39 0 0 1 0.5120 1 1 0.63 0 1 0 0.7580 0 0 0.36 1 0 0 0.4450 1 . . .
Each line of data represents a person. The fields are sex (male = 0, female = 1), age (normalized by dividing by 100), state (Michigan = 100, Nebraska = 010, Oklahoma = 001), annual income (divided by 100,000), and politics type (0 = conservative, 1 = moderate, 2 = liberal). The goal is to predict politics type from sex, age, state, income. There are 200 training items and 40 test items.
The signature of the bagging constructor is:
# BaggingClassifier(estimator=None, n_estimators=10, *, # max_samples=1.0, max_features=1.0, bootstrap=True, # bootstrap_features=False, oob_score=False, # warm_start=False, n_jobs=None, random_state=None, # verbose=0, base_estimator='deprecated')
The estimator=None means to use the basic scikit DecisionTreeClassifier with all of its default parameters (no max depth, Gini algorithm split, etc., etc.) The max_samples=1.0 means use a random selection of 100% of the training data for each of the 10 trees. The bootstrap=True means select with replacement.
For my demo, I used all the default values except that I supplied a random_state seed value so that results are reproducible.
There’s no moral to the story. Just an interesting experiment with bagging.
Three examples of fashion made from brown paper bags, with varying degrees of sophistication.
Demo code. The data can be found at https://jamesmccaffrey.wordpress.com/2023/02/13/multi-class-classification-using-a-scikit-decision-tree/.
# people_politics_bagging.py # predict politics (0 = con, 1 = mod, 2 = lib) # from sex, age, state, income. # uses "bootstrap aggregating" ("bagging") # sex age state income politics # 0 0.27 0 1 0 0.7610 2 # 1 0.19 0 0 1 0.6550 0 # sex: 0 = male, 1 = female # state: michigan = 100, nebraska = 010, oklahoma = 001 # politics: conservative, moderate, liberal # Anaconda3-2022.10 Python 3.9.13 scikit 1.0.2 # Windows 10/11 import numpy as np from sklearn.ensemble import BaggingClassifier # --------------------------------------------------------- def show_confusion(cm): dim = len(cm) mx = np.max(cm) # largest count in cm wid = len(str(mx)) + 1 # width to print fmt = "%" + str(wid) + "d" # like "%3d" for i in range(dim): print("actual ", end="") print("%3d:" % i, end="") for j in range(dim): print(fmt % cm[i][j], end="") print("") print("------------") print("predicted ", end="") for j in range(dim): print(fmt % j, end="") print("") # --------------------------------------------------------- def main(): # 0. get ready print("\nBegin scikit bootstrap aggregation example ") print("Predict politics from sex, age, State, income ") np.random.seed(1) np.set_printoptions(precision=4, suppress=True) # sex age state income politics # 0 0.27 0 1 0 0.7610 2 # 1 0.19 0 0 1 0.6550 0 # 1. load data print("\nLoading data into memory ") train_file = ".\\Data\\people_train.txt" train_xy = np.loadtxt(train_file, usecols=range(0,7), delimiter="\t", comments="#", dtype=np.float32) train_x = train_xy[:,0:6] train_y = train_xy[:,6].astype(int) test_file = ".\\Data\\people_test.txt" test_xy = np.loadtxt(test_file, usecols=range(0,7), delimiter="\t", comments="#", dtype=np.float32) test_x = test_xy[:,0:6] test_y = test_xy[:,6].astype(int) print("\nTraining data:") print(train_x[0:4]) print(". . . \n") print(train_y[0:4]) print(". . . ") # --------------------------------------------------------- # 2. create and train # BaggingClassifier(estimator=None, n_estimators=10, *, # max_samples=1.0, max_features=1.0, bootstrap=True, # bootstrap_features=False, oob_score=False, # warm_start=False, n_jobs=None, random_state=None, # verbose=0, base_estimator='deprecated') print("\nCreating bagging DecisionTreeClassifier model ") model = BaggingClassifier(random_state=1) model.fit(train_x, train_y) print("Done ") # 3. evaluate acc_train = model.score(train_x, train_y) print("\nAccuracy on train = %0.4f " % acc_train) acc_test = model.score(test_x, test_y) print("Accuracy on test = %0.4f " % acc_test) # 3b. display formatted confusion matrix from sklearn.metrics import confusion_matrix y_predicteds = model.predict(test_x) cm = confusion_matrix(test_y, y_predicteds) print("\nConfusion matrix: \n") show_confusion(cm) # 4. use model print("\nPredict for: M 35 Nebraska $55K ") X = np.array([[0, 0.35, 0,1,0, 0.5500]], dtype=np.float32) probs = model.predict_proba(X) print("\nPrediction pseudo-probs: ") print(probs) politic = model.predict(X) print("\nPredicted class: ") print(politic) # 6. TODO: save model using pickle # import pickle # print("Saving trained tree model ") # path = ".\\Models\\tree_bagging_model.sav" # pickle.dump(model, open(path, "wb")) # use saved model # X = np.array([[0, 0.35, 0,1,0, 0.5500]], # dtype=np.float32) # with open(path, 'rb') as f: # loaded_model = pickle.load(f) # pa = loaded_model.predict_proba(X) # print(pa) print("\nEnd scikit bagging tree demo ") if __name__ == "__main__": main()
You must be logged in to post a comment.