An Example of AdaBoost Classification Using the scikit Library

Basic decision trees have several weaknesses and so there are many enhanced tree models. These include, in order of increasing complexity, bootstrap aggregation (“bagging”), random forest, adaptive boosting (“AdaBoost”), and gradient boosting. There are many variations of each of the four enhanced tree models.

Note: Gradient boosting is an advanced form of AdaBoost, and XGBoost (“extreme boosting”) is an advanced form of gradient boosting. The XGBoost algorithm is not directly implemented in the scikit library.

I put together a demo of the scikit AdaBoost module.

In very high-level pseudo-code, the AdaBoost algorithm looks like:

create a primitive decision stub tree
loop 50 times
  create a new weighted decison stub
  add new stub to ensemble
end-loop
model = majority vote of the 50 trees

The pseudo-code omits many important details. Here’s another version of pseudo-code that has more details. It assumes a binary classification scenario, where the two classes are coded as -1 and +1.

There are several variations of AdaBoost. They’re all fairly complex but the Wikipedia article on AdaBoost is pretty good (unlike many Wikipedia machine learning articles).

I put together a demo. I used one of my standard multi-class classification problems. The data looks like:

 1   0.24   1   0   0   0.2950   2
 0   0.39   0   0   1   0.5120   1
 1   0.63   0   1   0   0.7580   0
 0   0.36   1   0   0   0.4450   1
. . . 

Each line of data represents a person. The fields are sex (male = 0, female = 1), age (normalized by dividing by 100), state (Michigan = 100, Nebraska = 010, Oklahoma = 001), annual income (divided by 100,000), and politics type (0 = conservative, 1 = moderate, 2 = liberal). The goal is to predict politics type from sex, age, state, income. There are 200 training items and 40 test items.

The signature of the AdaBoost module constructor is deceptively simple:

  # AdaBoostClassifier(base_estimator=None, *, n_estimators=50,
  #  learning_rate=1.0, algorithm='SAMME.R',
  #  random_state=None)

The actual parameter complexity comes from the internal DecisionTreeClassifier which is used by default as the estimator:

  # DecisionTreeClassifier(*, criterion='gini',
  #  splitter='best', max_depth=None, min_samples_split=2,
  #  min_samples_leaf=1, min_weight_fraction_leaf=0.0,
  #  max_features=None, random_state=None,
  #  max_leaf_nodes=None, min_impurity_decrease=0.0, 
  #  class_weight=None, ccp_alpha=0.0)

For my demo I created an AdaBoost classifier with all default parameters except for supplying a random_state value so that results are reproducible:

  print("Creating AdaBoost model using default params ")
  from sklearn.tree import DecisionTreeClassifier
  classifier = DecisionTreeClassifier(max_depth=1)
  model = AdaBoostClassifier(base_estimator=classifier,
    n_estimators=50, learning_rate=1.0, random_state=1)
  model.fit(train_x, train_y)
  print("Done ")

The results weren’t very good. As usual with most tree-based classifiers, prediction accuracy on the training data is very good but the model was overfitted and had poor accuracy on the test data.

Variations of decision tree classifiers are seductive in the sense that they’re very simple and easy to understand. But neural network classifiers have enabled the fantastic breakthroughs in artificial intelligence and machine learning. Even so, tree-based classifiers can still be useful in many real-world scenarios.

Good fun.



Machine learning decision tree models are seductive to newcomers to machine learning but tree models often turn out well. Female alien seduction in science fiction movies usually doesn’t turn out well for the seductee.

Left: In “Lifeforce” (1985) the crew of a space shuttle discovers a huge alien spaceship with the bodies of two men and a woman. Sure, let’s bring them to Earth. Unfortunately, all three are space vampires, including the one female known as Space Girl. This is a pretty good movie.

Center: In “Queen of Blood” (1966), a crew from Earth goes to Mars and discovers a crashed alien spaceship. There’s a female alien inside. Sure, let’s bring her back to Earth. Unfortunately she is a space vampire who can seduce with glowing eyes. Not a bad movie if you’re a fan of old sci-fi B quality movies like I am.

Right: In “Species” (1995) scientists receive information from aliens about how to splice their DNA with human DNA. Sure, let’s try that on Earth. Unfortunately, the result is a super alien woman named Sil who wants to reproduce. The consequences of mating with Sil are not pleasant for her male victims. This is a surprisingly good movie.


Demo code below. The training and test data can be found at https://jamesmccaffrey.wordpress.com/2023/02/13/multi-class-classification-using-a-scikit-decision-tree.

# people_politics_adaboost.py

# predict politics (0 = con, 1 = mod, 2 = lib) 
# from sex, age, state, income.
# uses AdaBoost ("adaptive boosting") algorithm

# sex  age    state    income   politics
#  0   0.27   0  1  0   0.7610   2
#  1   0.19   0  0  1   0.6550   0
# sex: 0 = male, 1 = female
# state: michigan = 100, nebraska = 010, oklahoma = 001
# politics: conservative, moderate, liberal

# Anaconda3-2022.10  Python 3.9.13  scikit 1.0.2
# Windows 10/11

import numpy as np 
from sklearn.ensemble import AdaBoostClassifier  

# ---------------------------------------------------------

def show_confusion(cm):
  dim = len(cm)
  mx = np.max(cm)             # largest count in cm
  wid = len(str(mx)) + 1      # width to print
  fmt = "%" + str(wid) + "d"  # like "%3d"
  for i in range(dim):
    print("actual   ", end="")
    print("%3d:" % i, end="")
    for j in range(dim):
      print(fmt % cm[i][j], end="")
    print("")
  print("------------")
  print("predicted    ", end="")
  for j in range(dim):
    print(fmt % j, end="")
  print("")

# ---------------------------------------------------------

def main():
  # 0. get ready
  print("\nBegin scikit AdaBoost example ")
  print("Predict politics from sex, age, State, income ")
  np.random.seed(1)
  np.set_printoptions(precision=4, suppress=True)

  # sex  age    state    income   politics
  #  0   0.27   0  1  0   0.7610   2
  #  1   0.19   0  0  1   0.6550   0

  # 1. load data
  print("\nLoading data into memory ")
  train_file = ".\\Data\\people_train.txt"
  train_xy = np.loadtxt(train_file, usecols=range(0,7),
    delimiter="\t", comments="#",  dtype=np.float32) 
  train_x = train_xy[:,0:6]
  train_y = train_xy[:,6].astype(int)

  test_file = ".\\Data\\people_test.txt"
  test_xy = np.loadtxt(test_file, usecols=range(0,7),
    delimiter="\t", comments="#",  dtype=np.float32) 
  test_x = test_xy[:,0:6]
  test_y = test_xy[:,6].astype(int)

  print("\nTraining data:")
  print(train_x[0:4])
  print(". . . \n")
  print(train_y[0:4])
  print(". . . ")

# ---------------------------------------------------------

  # 2. create and train 
  # AdaBoostClassifier(estimator=None, *, n_estimators=50,
  #  learning_rate=1.0, algorithm='SAMME.R',
  #  random_state=None, base_estimator='deprecated')

  # DecisionTreeClassifier(*, criterion='gini',
  #  splitter='best', max_depth=None, min_samples_split=2,
  #  min_samples_leaf=1, min_weight_fraction_leaf=0.0,
  #  max_features=None, random_state=None,
  #  max_leaf_nodes=None, min_impurity_decrease=0.0, 
  #  class_weight=None, ccp_alpha=0.0)

  print("\nCreating AdaBoost model using default params ")
  from sklearn.tree import DecisionTreeClassifier
  classifier = DecisionTreeClassifier(max_depth=1)
  model = AdaBoostClassifier(base_estimator=classifier,
    n_estimators=50, learning_rate=1.0, random_state=1)
  model.fit(train_x, train_y)
  print("Done ")

  # 3. evaluate
  acc_train = model.score(train_x, train_y)
  print("\nAccuracy on train = %0.4f " % acc_train)
  acc_test = model.score(test_x, test_y)
  print("Accuracy on test = %0.4f " % acc_test)

  # 3b. display formatted confusion matrix
  from sklearn.metrics import confusion_matrix
  y_predicteds = model.predict(test_x)
  cm = confusion_matrix(test_y, y_predicteds)
  print("\nConfusion matrix: \n")
  show_confusion(cm)

  # 4. use model
  print("\nPredict for: M 35 Nebraska $55K ")
  X = np.array([[0, 0.35, 0,1,0, 0.5500]],
    dtype=np.float32)
  probs = model.predict_proba(X)
  print("\nPrediction pseudo-probs: ")
  print(probs)

  politic = model.predict(X)
  print("\nPredicted class: ")
  print(politic)

  # 5. TODO: save model using pickle

  print("\nEnd scikit AdaBoost demo ")

if __name__ == "__main__":
  main()
This entry was posted in Scikit. Bookmark the permalink.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s