Why Do I Never Remember the Differences Between ROC AUC and PR AUC?

I was working on an anomaly detection system recently. Whem working with any machine learning prediction system, you should evaluate the effectiveness of the system. The basic effectiveness metric is prediction accuracy. But in systems where there is imbalanced data, accuracy isn’t very good. For example, in an anomaly detection scenario, suppose 99% of your data items are normal (class 0, negative) and 1% are anomalous (class 1, positive). If you just predict class 0 for every input item you’ll get 99% accuracy.

So, with imbalanced data you should look at precision and recall. Both of these metrics ignore true negatives, but in a slightly different way.

In addition to imbalanced data, a second characteristic of ML prediction systems is the threshold value. Most prediction systems emit a pseudo-probability (pp) value between 0 and 1. The default approach is to set a threshold value of 0.5, and then a computed pp less than 0.5 indicates class 0 = normal = negative, and a computed pp value greater than 0.5 indicates class 1 = anomalous = positive. But you can set the threshold value to whatever you want. Adjusting the threshold value will change accuracy, precision, and recall.

OK, I’m slowly getting to the point.

An anomaly detection system is a binary prediction system — an input item is predicted as normal or anomalous. In such situations there are exactly four possible outcomes when you make a prediction.

TP: true positive (corectly predict an anomaly)
TN: true negative (correctly predict a normal)
FP: false positive (incorrectly predict an anomaly as normal)
FN: false negative (incorrectly predict a normal as anomaly)

For a specific threshold value, you can calculate four basic metrics for an ML prediction system:

accuracy  = (TP + TN) / (TP + TN + FP + FN) - uses all
precision = TP / (TP + FP) - ignores TNs
recall    = TP / (TP + FN) - ignores TNs
F1        = harmonic_avg(pre, rec) - ignores TNs

The F1 score is the harmonic average of precision and recall. You need to look at both precision and recall because if you adjust the threshold, if precision goes up (good) then recall will go down (bad), and vice versa.


The area under the ROC curve for model B is greater than the area under model A, so model B is better overall. Each ROC curve is really just three points that correspond to three threshold values.

Now if you evaluate a prediction system using different theshold values, you can graph the results. There are two common graphs — the ROC (hideously named receiver operating characteristic) curve, and the PR (precision-recall) curve:

ROC curve = FP (x-axis) vs. TP (y-axis) - ignores TN and FN
PR curve  = recall (x-axis) vs. precision (y-axis) - ignores TN

Each curve is made of many points. One way to summarize a curve/graph as a single number is to compute the area under the curve (AUC). So there is ROC AUC and PR AUC.

Putting it all together, ROC AUC is a metric that summarizes model effectiveness (over different threshold values), which ignores all negative / normal / class 0 items. PR AUC is a metric that summarizes model effectivenss, which ignores true negatives. Both are appropriate to use when there are many class 0 / normal items and very few class 1 / anomalous items.



Here are two scenarios where you don’t need a machine learning model to predict what will happen.

This entry was posted in Machine Learning. Bookmark the permalink.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s