“Binary Classification Using a scikit Decision Tree” in Visual Studio Magazine

I wrote an article titled “Binary Classification Using a scikit Decision Tree” in the February 2023 edition of Microsoft Visual Studio Magazine. See https://visualstudiomagazine.com/articles/2023/02/21/scikit-decision-tree.aspx.

A decision tree is a machine learning technique that can be used for binary classification or multi-class classification. My article presents an end-to-end demo that predicts the sex a person (male = 0 or female = 1) based on their age, state where they live, income and political leaning.

There are several tools and code libraries that you can use to perform binary classification using a decision tree. The scikit-learn library (also called scikit or sklearn) is based on the Python language and is one of the most popular machine learning libraries.

The article demo data is one of my standard synthetic datasets and looks like:

1   0.24   1   0   0   0.2950   0   0   1
0   0.39   0   0   1   0.5120   0   1   0
1   0.63   0   1   0   0.7580   1   0   0
0   0.36   1   0   0   0.4450   0   1   0
1   0.27   0   1   0   0.2860   0   0   1
. . .

The tab-delimited fields are sex (0 = male, 1 = female), age (divided by 100), state (Michigan = 100, Nebraska = 010, Oklahoma = 001), income (divided by $100,000) and political leaning (conservative = 100, moderate = 010, liberal = 001).

One of the advantages of the scikit library is simplicity (at the expense of flexibility). Creating and training a decision tree is easy:

  md = 4
  print("Creating decision tree max_depth=" + str(md))
  model = tree.DecisionTreeClassifier(max_depth=md,
  model.fit(train_x, train_y)
  print("Done ")

An advantage of decision tree classifiers over neural network classifiers is that decision trees are somewhat interpretable because a decision tree is just a set of if-then rules. For example:

|--- income <= 0.34
|   |--- pol2 <= 0.50
|   |   |--- age <= 0.23
|   |   |   |--- income <= 0.28
|   |   |   |   |--- class: 1.0
. . .

The two main downsides to decision trees are that they often don’t work well with large datasets, and they are highly susceptible to model overfitting.

One of the main areas of social science research is the study of behavioral differences between men and women. It’s well-known that men and women think about relationships differently. Left: How a woman thinks about her relationship with a man. Complicated. Right: How a man thinks about his relationship with a woman. Not so complicated.

This entry was posted in Scikit. Bookmark the permalink.

1 Response to “Binary Classification Using a scikit Decision Tree” in Visual Studio Magazine

  1. Thorsten Kleppe says:

    Hey James, I’m happy like a little kid every time you post. There are good sources out there, I like twitter, where incredibly much useful information can be found. But your blog is still the place to be when it comes to Machine Learning. You always provide such understandable examples. For me, you are and remain the best!

    I stupidly broke my dominant hand a few weeks ago and it was a painful experience, but I’m recovering. Take care and stay awesome!

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s