An Example of Bootstrap Aggregation (Bagging) Classification Using the scikit Library

Basic decision trees have several weaknesses and so there are many enhanced tree models. These include, in order of increasing complexity, bootstrap aggregation (“bagging”), random forest, adaptive boosting (“AdaBoost”), and gradient boosting. There are many variations of each of the four enhanced tree models.

In high-level pseudo-code, scikit default bagging is:

loop 10 times
  fetch a random subset of training data
  create a basic decision tree from subset
end-loop
model = majority vote of the 10 trees

By default, each random subset of the N training data items is selected by picking N items with replacement. The idea is that the order of the data will change and some data items will not be picked. These two operations reduce model overfitting.

I put together a demo. I used one of my standard multi-class classification problems. The data looks like:

 1   0.24   1   0   0   0.2950   2
 0   0.39   0   0   1   0.5120   1
 1   0.63   0   1   0   0.7580   0
 0   0.36   1   0   0   0.4450   1
. . . 

Each line of data represents a person. The fields are sex (male = 0, female = 1), age (normalized by dividing by 100), state (Michigan = 100, Nebraska = 010, Oklahoma = 001), annual income (divided by 100,000), and politics type (0 = conservative, 1 = moderate, 2 = liberal). The goal is to predict politics type from sex, age, state, income. There are 200 training items and 40 test items.

The signature of the bagging constructor is:

  # BaggingClassifier(estimator=None, n_estimators=10, *,
  #  max_samples=1.0, max_features=1.0, bootstrap=True,
  #  bootstrap_features=False, oob_score=False,
  #  warm_start=False, n_jobs=None, random_state=None,
  #  verbose=0, base_estimator='deprecated')

The estimator=None means to use the basic scikit DecisionTreeClassifier with all of its default parameters (no max depth, Gini algorithm split, etc., etc.) The max_samples=1.0 means use a random selection of 100% of the training data for each of the 10 trees. The bootstrap=True means select with replacement.

For my demo, I used all the default values except that I supplied a random_state seed value so that results are reproducible.

There’s no moral to the story. Just an interesting experiment with bagging.



Three examples of fashion made from brown paper bags, with varying degrees of sophistication.


Demo code. The data can be found at https://jamesmccaffrey.wordpress.com/2023/02/13/multi-class-classification-using-a-scikit-decision-tree/.

# people_politics_bagging.py

# predict politics (0 = con, 1 = mod, 2 = lib) 
# from sex, age, state, income.
# uses "bootstrap aggregating" ("bagging")

# sex  age    state    income   politics
#  0   0.27   0  1  0   0.7610   2
#  1   0.19   0  0  1   0.6550   0
# sex: 0 = male, 1 = female
# state: michigan = 100, nebraska = 010, oklahoma = 001
# politics: conservative, moderate, liberal

# Anaconda3-2022.10  Python 3.9.13  scikit 1.0.2
# Windows 10/11

import numpy as np 
from sklearn.ensemble import BaggingClassifier  

# ---------------------------------------------------------

def show_confusion(cm):
  dim = len(cm)
  mx = np.max(cm)             # largest count in cm
  wid = len(str(mx)) + 1      # width to print
  fmt = "%" + str(wid) + "d"  # like "%3d"
  for i in range(dim):
    print("actual   ", end="")
    print("%3d:" % i, end="")
    for j in range(dim):
      print(fmt % cm[i][j], end="")
    print("")
  print("------------")
  print("predicted    ", end="")
  for j in range(dim):
    print(fmt % j, end="")
  print("")

# ---------------------------------------------------------

def main():
  # 0. get ready
  print("\nBegin scikit bootstrap aggregation example ")
  print("Predict politics from sex, age, State, income ")
  np.random.seed(1)
  np.set_printoptions(precision=4, suppress=True)

  # sex  age    state    income   politics
  #  0   0.27   0  1  0   0.7610   2
  #  1   0.19   0  0  1   0.6550   0

  # 1. load data
  print("\nLoading data into memory ")
  train_file = ".\\Data\\people_train.txt"
  train_xy = np.loadtxt(train_file, usecols=range(0,7),
    delimiter="\t", comments="#",  dtype=np.float32) 
  train_x = train_xy[:,0:6]
  train_y = train_xy[:,6].astype(int)

  test_file = ".\\Data\\people_test.txt"
  test_xy = np.loadtxt(test_file, usecols=range(0,7),
    delimiter="\t", comments="#",  dtype=np.float32) 
  test_x = test_xy[:,0:6]
  test_y = test_xy[:,6].astype(int)

  print("\nTraining data:")
  print(train_x[0:4])
  print(". . . \n")
  print(train_y[0:4])
  print(". . . ")

# ---------------------------------------------------------

  # 2. create and train 
  # BaggingClassifier(estimator=None, n_estimators=10, *,
  #  max_samples=1.0, max_features=1.0, bootstrap=True,
  #  bootstrap_features=False, oob_score=False,
  #  warm_start=False, n_jobs=None, random_state=None,
  #  verbose=0, base_estimator='deprecated')

  print("\nCreating bagging DecisionTreeClassifier model ")
  model = BaggingClassifier(random_state=1)
  model.fit(train_x, train_y)
  print("Done ")

  # 3. evaluate
  acc_train = model.score(train_x, train_y)
  print("\nAccuracy on train = %0.4f " % acc_train)
  acc_test = model.score(test_x, test_y)
  print("Accuracy on test = %0.4f " % acc_test)

  # 3b. display formatted confusion matrix
  from sklearn.metrics import confusion_matrix
  y_predicteds = model.predict(test_x)
  cm = confusion_matrix(test_y, y_predicteds)
  print("\nConfusion matrix: \n")
  show_confusion(cm)

  # 4. use model
  print("\nPredict for: M 35 Nebraska $55K ")
  X = np.array([[0, 0.35, 0,1,0, 0.5500]],
    dtype=np.float32)
  probs = model.predict_proba(X)
  print("\nPrediction pseudo-probs: ")
  print(probs)

  politic = model.predict(X)
  print("\nPredicted class: ")
  print(politic)

  # 6. TODO: save model using pickle
  # import pickle
  # print("Saving trained tree model ")
  # path = ".\\Models\\tree_bagging_model.sav"
  # pickle.dump(model, open(path, "wb"))

  # use saved model
  # X = np.array([[0, 0.35, 0,1,0, 0.5500]],
  #   dtype=np.float32)
  # with open(path, 'rb') as f:
  #   loaded_model = pickle.load(f)
  # pa = loaded_model.predict_proba(X)
  # print(pa)

  print("\nEnd scikit bagging tree demo ")

if __name__ == "__main__":
  main()
This entry was posted in Scikit. Bookmark the permalink.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s