Datasets for Binary Classification

The goal of a binary classification problem is to create a machine learning model that makes a prediction in situations where the thing to predict can take one of just two possible values. For example, you might want to predict whether a person is male (0) or female (1) based on predictor variables such as age, income, height, political party affiliation, and so on.

There are many different techniques you can use for a binary classification problem. These techniques include logistic regression, k-NN (if all predictors are numeric), naive Bayes (if all predictors are non-numeric), support vector machines (rarely used any more), decision trees and random forest, and many others. My favorite technique is to use a standard neural network.

If you want to explore binary classification techniques, you need a dataset. You can make your own fake data, but using a standard benchmark dataset is often a better idea because you can compare your results with others.

Here’s a brief description of four of the benchmark datasets I often use for exploring binary classification techniques. These datasets are relatively small and have all or mostly all numeric predictor variables so none, or not much, data encoding is needed.

1. The Cleveland Heart Disease Dataset

There are 303 items (patients), six have a missing value. There are 13 predictor variables (age, sex, cholesterol, etc.) The variable to predict is encoded as 0 to 4 where 0 means no heart disease and 1-4 means presence of heart disease. See https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data. Sample:

63.0,1.0,1.0,145.0, . . 6.0,0
67.0,1.0,4.0,160.0, . . 3.0,2
67.0,1.0,4.0,120.0, . . 7.0,1
. . .

2. The Banknote Authentication Dataset

There are 1372 items (images of banknotes — think Euro or dollar bill). There are 4 predictor variables (variance of image, skewness, kurtosis, entropy). The variable to predict is encoded as 0 (authentic) or 1 (forgery). See https://archive.ics.uci.edu/ml/datasets/banknote+authentication. Sample:

3.6216,8.6661,-2.8073,-0.44699,0
4.5459,8.1674,-2.4586,-1.4621,0
. . .
-1.3971,3.3191,-1.3927,-1.9948,1
0.39012,-0.14279,-0.031994,0.35084,1
. . .

3. The Wisconsin Cancer Dataset

There are 569 items (patients). There is an ID followed by 10 predictors variables (thickness, cell size uniformity, etc.) The variable to predict is encoded as 2 (benign) or 4 (malignant). See https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/. Sample:

1000025,5,1,1,1,2,1,3,1,1,2
1002945,5,4,4,5,7,10,3,2,1,2
. . .
1017122,8,10,10,8,7,10,9,7,1,4
1018099,1,1,1,1,2,10,3,1,1,2
. . .

4. Haberman’s Survival Dataset

There are 306 items (patients). There are three predictor variables (age, year of operation, number nodes). The variable to predict is encoded as 1 (survived) or 2 (died). See https://archive.ics.uci.edu/ml/datasets/Haberman%27s+Survival. Sample:

30,64,1,1
30,62,3,1
. . .
77,65,3,1
78,65,1,2
83,58,2,2

Here are some well-known datasets that I don’t like to use:

The Adult dataset to predict if a person makes more than $50,000 per year or not (see https://archive.ics.uci.edu/ml/datasets/Adult ) is popular but it has 48,842 items and eight of the 14 predictor variables are categorical.

The Titanic dataset (did a passenger survive or not – see https://www.kaggle.com/c/titanic ) is popular but requires you to sign up with Kaggle and get annoying messages, and the dataset has been pre-split into training and test sets which isn’t always wanted.

The Pima Indians Diabetes (woman has diabetes or not – see https://archive.ics.uci.edu/ml/datasets/pima+indians+diabetes ) dataset is popular, but the dataset makes no sense to me because some of the predictor variables have a value of 0 in situations where that is biologically impossible.



Binary star system GG Tauri-A

This entry was posted in Machine Learning. Bookmark the permalink.