An Example of Random Forest Classification Using the scikit Library

Basic decision trees have several weaknesses and so there are many enhanced tree models. These include, in order of increasing complexity, bootstrap aggregation (“bagging”), random forest, adaptive boosting (“AdaBoost”), and gradient boosting. There are many variations of each of the four enhanced tree models.

I put together a demo of the scikit random forest module.

In very high-level pseudo-code, scikit default random forest is:

loop N times
  fetch a random subset of training data
  create a basic decision tree from subset
end-loop
model = majority vote of the N trees

The random subset of the training data items is selected by picking only some of the predictor variables. For example, if there are 6 predictor variables, then each tree might be based on just 3 of the predictors. The scikit default is the square root of the number of predictors (rounded to the nearest integer).

The idea is that the order of the data will change for each sub-tree and robustness is introduced by looking at different sets of predictors. These two operations reduce model overfitting which is the major weakness of tree classifiers

I put together a demo. I used one of my standard multi-class classification problems. The data looks like:

 1   0.24   1   0   0   0.2950   2
 0   0.39   0   0   1   0.5120   1
 1   0.63   0   1   0   0.7580   0
 0   0.36   1   0   0   0.4450   1
. . . 

Each line of data represents a person. The fields are sex (male = 0, female = 1), age (normalized by dividing by 100), state (Michigan = 100, Nebraska = 010, Oklahoma = 001), annual income (divided by 100,000), and politics type (0 = conservative, 1 = moderate, 2 = liberal). The goal is to predict politics type from sex, age, state, income. There are 200 training items and 40 test items.

The signature of the random forest module constructor is complex:

  # RandomForestClassifier(n_estimators='warn',
  #  criterion='gini', max_depth=None, min_samples_split=2,
  #  min_samples_leaf=1, min_weight_fraction_leaf=0.0,
  #  max_features='auto', max_leaf_nodes=None,
  #  min_impurity_decrease=0.0, min_impurity_split=None,
  #  bootstrap=True, oob_score=False, n_jobs=None,
  #  random_state=None, verbose=0, warm_start=False,
  #  class_weight=None)

It would take at least a couple of pages to explain all these parameters but the two most important are the n_estimators (number of trees) and max_features (number of randomly selected predictors for each tree). Also important is the random_state parameter to get reproducible results. For my demo I tried:

  print("Creating RandomForestClassifier model ")
  model = RandomForestClassifier(n_estimators=10, 
    max_features=3, random_state=1)

  model.fit(train_x, train_y)
  print("Done ")

The results weren’t very good: as usual with most tree-based classifiers, prediction accuracy on the training data is excellent, but the model was overfitted and had poor accuracy on the test data.

Among my machine learning colleagues, guys tend to fall into one of two buckets: those who use mostly tree technques and those who use mostly neural techniques. I tend to use neural techniques but I’ll often look at a tree model too to see if the models agree.

Good fun.



One of my favorite movie genres is fantasy. Many fantasy films feature memorable forest scenes. Here are three forest scenes randomly selected from my memory. Left: In “The Fellowship of the Ring” (2001), the Hobbits are pursued by the Dark Riders in the forest. Very scary! Center: In “Labyrinth” (1986), Sarah is searching for her baby brother who was stolen by Jareth, the Goblin King. Not scary. Right: In “The Brothers Grimm” (2005), Wilhelm and Jacob must go through a very evil forest with very evil trees to get to the castle of the very evil queen. Very scary.


Demo code below. The training and test data can be found at https://jamesmccaffrey.wordpress.com/2023/02/13/multi-class-classification-using-a-scikit-decision-tree/

# people_politics_forest.py

# predict politics (0 = con, 1 = mod, 2 = lib) 
# from sex, age, state, income.
# uses random forest

# sex  age    state    income   politics
#  0   0.27   0  1  0   0.7610   2
#  1   0.19   0  0  1   0.6550   0
# sex: 0 = male, 1 = female
# state: michigan = 100, nebraska = 010, oklahoma = 001
# politics: conservative, moderate, liberal

# Anaconda3-2022.10  Python 3.9.13  scikit 1.0.2
# Windows 10/11

import numpy as np 
from sklearn.ensemble import RandomForestClassifier  

# ---------------------------------------------------------

def show_confusion(cm):
  dim = len(cm)
  mx = np.max(cm)             # largest count in cm
  wid = len(str(mx)) + 1      # width to print
  fmt = "%" + str(wid) + "d"  # like "%3d"
  for i in range(dim):
    print("actual   ", end="")
    print("%3d:" % i, end="")
    for j in range(dim):
      print(fmt % cm[i][j], end="")
    print("")
  print("------------")
  print("predicted    ", end="")
  for j in range(dim):
    print(fmt % j, end="")
  print("")

# ---------------------------------------------------------

def main():
  # 0. get ready
  print("\nBegin scikit random forest example ")
  print("Predict politics from sex, age, State, income ")
  np.random.seed(1)
  np.set_printoptions(precision=4, suppress=True)

  # sex  age    state    income   politics
  #  0   0.27   0  1  0   0.7610   2
  #  1   0.19   0  0  1   0.6550   0

  # 1. load data
  print("\nLoading data into memory ")
  train_file = ".\\Data\\people_train.txt"
  train_xy = np.loadtxt(train_file, usecols=range(0,7),
    delimiter="\t", comments="#",  dtype=np.float32) 
  train_x = train_xy[:,0:6]
  train_y = train_xy[:,6].astype(int)

  test_file = ".\\Data\\people_test.txt"
  test_xy = np.loadtxt(test_file, usecols=range(0,7),
    delimiter="\t", comments="#",  dtype=np.float32) 
  test_x = test_xy[:,0:6]
  test_y = test_xy[:,6].astype(int)

  print("\nTraining data:")
  print(train_x[0:4])
  print(". . . \n")
  print(train_y[0:4])
  print(". . . ")

# ---------------------------------------------------------

  # 2. create and train 
  # RandomForestClassifier(n_estimators='warn',
  #  criterion='gini', max_depth=None, min_samples_split=2,
  #  min_samples_leaf=1, min_weight_fraction_leaf=0.0,
  #  max_features='auto', max_leaf_nodes=None,
  #  min_impurity_decrease=0.0, min_impurity_split=None,
  #  bootstrap=True, oob_score=False, n_jobs=None,
  #  random_state=None, verbose=0, warm_start=False,
  #  class_weight=None)

  print("\nCreating RandomForestClassifier model ")
  model = RandomForestClassifier(n_estimators=10,
    max_features=3, random_state=1)
  model.fit(train_x, train_y)
  print("Done ")

  # 3. evaluate
  acc_train = model.score(train_x, train_y)
  print("\nAccuracy on train = %0.4f " % acc_train)
  acc_test = model.score(test_x, test_y)
  print("Accuracy on test = %0.4f " % acc_test)

  # 3b. display formatted confusion matrix
  from sklearn.metrics import confusion_matrix
  y_predicteds = model.predict(test_x)
  cm = confusion_matrix(test_y, y_predicteds)
  print("\nConfusion matrix: \n")
  show_confusion(cm)

  # 4. use model
  print("\nPredict for: M 35 Nebraska $55K ")
  X = np.array([[0, 0.35, 0,1,0, 0.5500]],
    dtype=np.float32)
  probs = model.predict_proba(X)
  print("\nPrediction pseudo-probs: ")
  print(probs)

  politic = model.predict(X)
  print("\nPredicted class: ")
  print(politic)

  # 5. TODO: save model using pickle

  print("\nEnd scikit random forest demo ")

if __name__ == "__main__":
  main()
This entry was posted in Scikit. Bookmark the permalink.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s