In machine learning, a binary classification problem is one where you are trying to predict something that can be one of two values. For example, suppose you are trying to predict if a baseball team will win (the “+” result) or lose (the “-” result). There are many ways to do binary classification. Probably the most basic technique is called logistic regression classification.

So you create some binary classifier and then use it to make predictions for historical data where you know the actual results. Suppose there are 100 games in your test set. There are four possible outcomes:

You predict team will win and they do win.

You predict team will win but they lose.

You predict team will lose and they do lose.

You predict team will lose but they win.

In general, the four outcomes are called True Positive (you predict + and are correct), False Positive (you predict + but are incorrect), True Negative (you predict – and are correct), and False Negative (you predict – but are incorrect).

Suppose that for the 100 games, your results are:

True Positive (TP) = 40 (correctly predicted a win)

False Positive (FP) = 20 (incorrectly predicted a win)

True Negative (TN) = 30 (correctly predicted a loss)

False Negative (FN) = 10 (incorrectly predicted a loss)

If you put the data above into a 2×2 table, it’s called a “confusion matrix”.

The most fundamental way to evaluate your binary classification model is to compute your accuracy. Here you were correct a total of 40 + 30 = 70 times out of 100 so the model’s accuracy is 0.70. Pretty good.

A fancier way to evaluate the model is to compute “precision” and “recall”. Precision and recall are defined:

Precision = TP / (TP+FP) = 40 / (40+20) = 40/60 = 0.67

Recall = TP / (TP+FN) = 40 / (40+10) = 40/50 = 0.80

Precision and recall are both similar to accuracy, but both are very difficult to understand conceptually. Precision is sort of like accuracy but it looks only at the data you predicted positive (in this example you’re only looking at data where you predict a win). Recall is also sort of like accuracy but it looks only at the data that is “relevant” in some way.

I go crazy trying to understand the deep meaning of precision and recall, and much prefer to just think of them as two numbers that measure the quality of a binary classification model.

Now any ML binary classifier has one or more parameters that you can adjust, which will create a different resulting model. In the case of a logistic regression classifier, you can adjust something called the threshold, which is an internal number between 0 and 1 that determines whether a prediction is positive or not. As you increase the threshold value above 0.5, it becomes more difficult for a data item to be classified as positive.

So in the example of predicting whether the baseball team will win (so you can bet on them), if you use a high threshold, like 0.75, then you won’t get as many “win” predictions as you would with a lower threshold value, but with the higher threshold you’ll be more likely to win you bet when the classifier predicts a win. In other words there’s a tradeoff between getting lots of betting opportunities with a moderate probability of winning, and getting fewer betting opportunities but with a higher probability of winning.

If you change a binary classifier parameter (the threshold for a logistic regression classifier), it turns out the precision and recall will change. But if the precision increases (your chance of winning your bet), the recall (your number of betting opportunities) will decrease. And vice versa.

For logistic regression classification, every value of the threshold will give you a precision value and a recall value. If you graph these points (with precision on the y-axis and recall on the x-axis), you get a precision-recall curve (or equivalently, a precision-recall graph). It would look something like the graph at the top of this post.

Each point on the precision-recall curve corresponds to a value of the threshold of the model. Unfortunately, precision-recall graphs usually don’t label each point with the corresponding value of the model parameter, even though they should.

Every problem will have different priorities and you have to adjust the threshold (or whatever parameters you’re using in you binary classifier) to get higher precision or recall, at the expense of the other factor.

(Note: Thanks to Richard Hughes who pointed out a math error in an earlier blog post on this topic.)