Haberman’s Survival Data

A few days ago, I gave a talk about logistic regression for binary classification. A binary classification problem is one where the goal is to make a prediction where the thing to predict can be just one of two possible values.

There are many machine learning techniques that can be used for binary classification, but arguably the most fundamental is logistic regression.

A classic binary classification benchmark dataset is called (Shelby J.) Haberman’s Survival Dataset. There are 306 data items. The raw data looks like:

. . .

The first field is the age of a patient. The second field is the xx part of the 19xx year when an operation was performed. The third field is the number of “nodes” (medical thing) detected in the patient. The last field is the thing to predict where 1 = the patient survived for five years or more, and 2 = the patient died within five years.

Importantly, the number of survivors in the dataset is 225 and the number of people who died is 81. So, if you just guessed “survived” (1) for every patient, you’d be correct with 225 / 306 = 0.7353 = 73.53% accuracy.

Anyway, I didn’t give the audience any hints and asked them to use CNTK to try and create a logistic regression model to predict the survive/die outcome of the 306 data items. I told them (honestly) that my best result was 0.7500 accuracy.

Anyway, the point of the exercise was to demonstrate that the problem is essentially unsolvable using logistic regression. Logistic regression can only do what’s called linearly separable classification. If you graph (age, nodes) vs. survival, or (year, nodes) vs. survival, you can see that there’s no way to draw a straight line that separates the two classes (blue = survive, red = die, in my graph above).

In fact, even a deep neural network can’t really do very well on Haberman’s data. The best results I’ve ever gotten on Haberman’s survival data use a primitive form of classification called k-NN (“k nearest neighbors). Unfortunately, some prediction problems just aren’t tractable.

Artist Thomas Haberman

This entry was posted in CNTK, Machine Learning. Bookmark the permalink.