Calculating Expected Calibration Error for Binary Classification

Suppose you have a binary classification model where the goal is to predict if a person has a disease of some kind, based on predictor variables such as blood pressure, score on a diagnostic test, cholesterol level, and so on. The output of the model is a value between 0 and 1 that indicates the likelihood that the person has the disease. Therefore, model output values can loosely be interpreted as probabilities where values less than 0.5 indicate class 0 (no disease) and values greater than 0.5 indicate class 1 (disease).

Output pseudo-probability values are sometimes called confidence values or just probabilities. I’ll use the term pseudo-probabilities.

A machine learning binary classification model is well-calibrated if the output pseudo-probabilities closely reflect the model accuracies. In other words, if the output pseudo-probability for a person is 0.75 then you’d like there to be roughly a 75% chance the model is correct — the person does in fact have the disease.

Some binary classification models are well-calibrated and some are not. The first step in dealing with model calibration is measuring it. There are many ways to measure binary classification model calibration but the most common is to calculate a metric called Calibration Error (CE). Small values of CE indicate a model that is well-calibrated; larger values of CE indicate a model that is less well-calibrated.

Note: For multi-class problems, CE is used with slight changes and is usually called expected calibration error (ECE). However both CE and ECE are terms that are used interchangeably for binary and multi-class problems.

Calculating CE is best explained by example. Suppose there are just 10 data items to keep the example manageable. Each data item generates an output pseudo-probability (pp) which determines the predicted class. The training data has the known correct class target value which determines if the prediction is correct or wrong.

```item   pp     pred   target   result
[0]    0.61    1      1       correct
[1]    0.39    0      1       wrong
[2]    0.31    0      0       correct
[3]    0.76    1      1       correct
[4]    0.22    0      1       wrong
[5]    0.59    1      1       correct
[6]    0.92    1      0       wrong
[7]    0.83    1      1       correct
[8]    0.57    1      1       correct
[9]    0.41    0      0       correct
```

In principle, the idea is to compare each pseudo-probability with the model accuracy. For example, if the data had 100 items that all generate a pseudo-probability of 0.75, then if the model is perfectly calibrated, you’d expect 75% of those 100 items to be correctly predicted and the remaining 25% of the items to be incorrectly predicted. The difference between the output pseudo-probability and model accuracy is a measure of miscalibration.

Unfortunately, this approach isn’t feasible because you’d need a huge amount of data so that there’d be enough items with each possible pseudo-probability. Therefore, data items have to be binned by output pseudo-probability.

The number of bins to use is arbitrary to some extent. Suppose you decide to use B = 3 equal-interval bins. Bin 1 is for pseudo-probabilities from 0.0 to 0.33, bin 2 is 0.34 to 0.66, bin 3 is 0.67 to 1.0. Each data item is associated with the bin that captures the item’s pseudo-probability. Therefore bin 1 contains items [2] and [4], bin 2 contains items [0], [1], [5], [8] and [9], and bin 3 contains items [3], [6] and [7].

For each bin, you calculate the model accuracy for the items in the bin, and the average pseudo-probability of the items in the bin. For bin 1, item [2] is correctly predicted but item [4] is incorrectly predicted. Therefore the model accuracy for bin 1 is 1/2 = 0.500. Similarly, the accuracy of bin 2 is 4/5 = 0.800. The accuracy of bin 3 is 2/3 = 0.667.

For bin 1, the average of the pseudo-probabilities is (0.31 + 0.22) / 2 = 0.265. Similarly, the average of the pseudo-probabilities in bin 2 is (0.61 + 0.39 + 0.59 + 0.57 + 0.41) / 5 = 0.514. The average pseudo-probability for bin 3 is (0.76 + 0.92 + 0.83) / 3 = 0.837.

Next, the absolute value of the difference between model accuracy and average pseudo-probability is calculated for each bin. At this point, the calculations are:

```bin                 acc    avg pp  |diff|
1   0.00 to 0.33   0.500  0.265   0.235
2   0.34 to 0.66   0.800  0.514   0.286
3   0.67 to 1.00   0.667  0.837   0.170
```

You could calculate a simple average of the bin absolute differences, but because each bin has a different number of data items, a better approach is to calculate a weighted average. Using a weighted average, the final CE value is calculated as:

CE = [(2 * 0.235) + (5 * 0.286) + (3 * 0.170)] / 10
= (0.470 + 1.430 + 0.510) / 10
= 2.410 / 10
= 0.241

Notice that if each bin accuracy equals the bin average pseudo-probability, the expected calibration error is 0.

The CE metric is simple and intuitive. But CE has some weaknesses. The number of bins to use is somewhat arbitrary, and with equal-interval bins, the number of data items in each bin could be significantly skewed. There are many variations of the basic CE metric but these variations add complexity.

In general, logistic regression binary classification models are usually well-calibrated, but support vector machine models and neural network models are less well-calibrated.

I’ve been thinking that maybe model calibration error can be used as a measure of dataset similarity. The idea is that similar datasets should have similar calibration error — maybe. It’s an idea that hasn’t been investigated as far as I know.

Binary wristwatches. Left: The time is 6:18. Center-left: A DeTomaso (same Italian company that produces sports cars). Center-right: A watch from a company called The One. Right: I had this Endura jump-hour watch in the 1960s and I was very proud of it.

This entry was posted in Machine Learning. Bookmark the permalink.