Accuracy, Precision, Recall, and F1 Score

If you have a binary classification problem, four fundamental metrics are accuracy, precision, recall, and F1 score. They’re best explained by example. Suppose the problem is to predict if a sports team will win or lose. There are four possible scenarios:

1. you predict the team will win and they do ("true positive")
2. you predict the team will win but they don't ("false positive")
3. you predict the team will lose and they do ("true negative")
4. you predict the team will lose but they don't ("false negative")

Suppose you make 100 predictions for different games and your results are:

TP = 40 (correctly predicted a win)
FP = 20 (incorrectly predict a win)
TN = 30 (correctly predicted a loss)
FN = 10 (incorrectly predict a loss)

The four metrics are:

1. accuracy = num correct / (num correct + num wrong)
            = (TP + TN) / (TP + FP + TN + FN)
            = 70 / 100
            = 0.70

2. precision = TP / (TP + FP)
             = 40 / (40 + 20)
             = 40 / 60
             = 0.67

3. recall = TP / (TP + FN)
          = 40 / (40 + 10)
          = 40 / 50
          = 0.80

4. F1 score = 1 / [ ((1 / 0.67) + (1 / 0.80)) / 2 ]
            = 1 / [ (1.50 + 1.25) / 2 ]
            = 1 / (2.75 / 2)
            = 1 / 1.375
            = 0.73

Accuracy is intuitive, and in my opinion, the single most important metric. Precision and recall are very difficult for me to interpret intuitively, so I just think of them only as metrics where higher values are better. As precision increases, recall must decrease, and vice versa. The F1 score is the harmonic average of precision and recall, the idea being that it gives you a single combined metric. Therefore, for F1 scores, larger values are better. Notice that the F1 score of 0.73 is between the precision (0.67) and recall (0.80). You could use a regular average instead of a harmonic average, but because precision and recall are both proportions, a harmonic average in more principled.



The movie “Total Recall” (1990) starring Arnold Schwarzenegger and Sharon Stone, had fantastic special effects for the time in which the movie was made. But the plot had me very confused — I never really knew exactly who was good and who was bad, even at the end of the movie. I don’t like ambiguous movie endings. The remake in 2012 was just plain bad, bad, bad.

Advertisements
This entry was posted in Machine Learning. Bookmark the permalink.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s