I was working on an anomaly detection system recently. Whem working with any machine learning prediction system, you should evaluate the effectiveness of the system. The basic effectiveness metric is prediction accuracy. But in systems where there is imbalanced data, accuracy isn’t very good. For example, in an anomaly detection scenario, suppose 99% of your data items are normal (class 0, negative) and 1% are anomalous (class 1, positive). If you just predict class 0 for every input item you’ll get 99% accuracy.
So, with imbalanced data you should look at precision and recall. Both of these metrics ignore true negatives, but in a slightly different way.
In addition to imbalanced data, a second characteristic of ML prediction systems is the threshold value. Most prediction systems emit a pseudo-probability (pp) value between 0 and 1. The default approach is to set a threshold value of 0.5, and then a computed pp less than 0.5 indicates class 0 = normal = negative, and a computed pp value greater than 0.5 indicates class 1 = anomalous = positive. But you can set the threshold value to whatever you want. Adjusting the threshold value will change accuracy, precision, and recall.
OK, I’m slowly getting to the point.
An anomaly detection system is a binary prediction system — an input item is predicted as normal or anomalous. In such situations there are exactly four possible outcomes when you make a prediction.
TP: true positive (corectly predict an anomaly)
TN: true negative (correctly predict a normal)
FP: false positive (incorrectly predict an anomaly as normal)
FN: false negative (incorrectly predict a normal as anomaly)
For a specific threshold value, you can calculate four basic metrics for an ML prediction system:
accuracy = (TP + TN) / (TP + TN + FP + FN) - uses all precision = TP / (TP + FP) - ignores TNs recall = TP / (TP + FN) - ignores TNs F1 = harmonic_avg(pre, rec) - ignores TNs
The F1 score is the harmonic average of precision and recall. You need to look at both precision and recall because if you adjust the threshold, if precision goes up (good) then recall will go down (bad), and vice versa.
Now if you evaluate a prediction system using different theshold values, you can graph the results. There are two common graphs — the ROC (hideously named receiver operating characteristic) curve, and the PR (precision-recall) curve:
ROC curve = FP (x-axis) vs. TP (y-axis) - ignores TN and FN PR curve = recall (x-axis) vs. precision (y-axis) - ignores TN
Each curve is made of many points. One way to summarize a curve/graph as a single number is to compute the area under the curve (AUC). So there is ROC AUC and PR AUC.
Putting it all together, ROC AUC is a metric that summarizes model effectiveness (over different threshold values), which ignores all negative / normal / class 0 items. PR AUC is a metric that summarizes model effectivenss, which ignores true negatives. Both are appropriate to use when there are many class 0 / normal items and very few class 1 / anomalous items.