Loss-Proportional Subsampling to Create a Training Data Coreset

Suppose you have a file of machine learning training data and the file is enormous — millions or even billions of data items. You might want to select a sample of the huge dataset (sometimes called a coreset) and use it for training instead of trying to work with the whole dataset.

Now suppose that your huge source dataset has some data items that are relatively rare. For example, in a dataset of technical PhD people, you might have many white males but almost no Black males. If you take a naive approach and just select a random sample of the source dataset, the sample will likely miss most of the minority class items, and your resulting machine learning prediction model will have poor accuracy for minority class items.

Two pages from the research paper. Like almost all research papers, the intended audience is experts in a very narrow field, rather than software engineers and data scientists who want to implement new ideas. This is neither good not bad — it just is.

I recently learned about a clever technique to intelligently sample from a large set of training data in a way that lessens the minority class exclusion problem that occurs with random sampling. The technique was published in a 2013 research paper titled “Loss-Proportional Subsampling for Subsequent ERM” by P. Mineiro and N. Karampatziakisis.

In the explanation that follows, I’ll assume that your huge source training dataset is intended for binary classification — predicting if a person is male or female. However, the loss-proportional subsampling technique also works for multi-class classification (e.g., predicting the city a person lives in) and regression problems (e.g., predicting the annual income of a person).

Suppose your huge dataset has many males (class 0) but very few females (class 1). You start by training a simple, crude binary classification model on the entire huge source dataset. This is feasible because simple classification techniques, such as logistic regression binary classification, are very fast and can be applied to huge datasets.

Next, you run each of the source data items through the simple classifier. This will generate a loss value for each data item. For example, the loss information might look like:

source data
item  prediction  actual   loss   prob of selection
[0]    0.80         1      0.04   1.5 * 0.04 = 0.06
[1]    0.50         0      0.25   1.5 * 0.25 = 0.37
[2]    0.90         1      0.01   1.5 * 0.01 = 0.02
[3]    0.70         1      0.09   1.5 * 0.09 = 0.13
. . .
[N]    0.85         1      0.02   1.5 * 0.02 = 0.03

In general, the loss (here measured as the square of the difference between predicted pseudo-probability and actual class label) for the majority class 0 = male items will be lower than the loss for minority class = 1 female items.

Next you modify the loss by multiplying by some value lambda, for example 1.5 as above. This creates a probability for each item in the large source dataset. The process of using the loss for each item to make a probability has many alternatives. For example, you could sum the loss values and then divide each loss by the sum:

source data
item  prediction  actual   loss   prob of selection
[0]    0.80         1      0.04   0.04 / 19.30 = 0.003
[1]    0.50         0      0.25   0.25 / 19.30 = 0.013
[2]    0.90         1      0.01   0.01 / 19.30 = 0.002
[3]    0.70         1      0.09   0.09 / 19.30 = 0.004
. . .
[N]    0.85         1      0.02   0.02 / 19.30 = 0.001
                          ------                -------
                     sum  19.30                    1

Now, suppose you want a sample that is 10% of the size of the large source dataset. You iterate through the source dataset. For each item, you select it and add it to the sample with the associated probability. In the example above, item [0] would be selected with prob = 0.003 (likely not selected), then item [1] would be selected with prob = 0.013 (more likely), and so on. You repeat until the sample has the desired number of items.

If you think about it you’ll see that minority class items have higher loss from the crude classifier and so they’ll have a higer probability of being included in the sample. Majority class items are less likely to be selected but that is balanced by the fact that there are many more of such items.

At this point you could now use the sample generated by loss-propotional selection to create a ML prediction model using a sophisticated classifier, such as a deep neural network. But if you use the sample directly, the prediction results will not be exactly the same as the results if you were somehow able to train the sophisticated classifier on the entire source dataset. In situations where you want to get the same prediction reults using the the sample dataset, you can compute an importance weight for each item. The importance weight is just 1 over the selection probability. For example:

sampled data
item  p(select)    w = importance weight
[0]    0.021        1 / 0.021 =  47.62
[1]    0.013        1 / 0.013 =  76.92
[2]    0.018        1 / 0.018 =  55.56     
. . .
[n]    0.016        1 / 0.016 =  62.50

During training, you weight each item by its importance weight. Items that had a low probability of being selected into the sample will have higher importance weights. This balances the fact that low probability items are more rare in the sample than they were in the source dataset.

The explanation I’ve given here has taken a few simplification shortcuts to make the ideas easier to understand. For the full story, you can read the research paper.

Very cool idea.

I have always loved cars. I’ve owned over 24 cars, starting with my first one — a 1957 Ford Custom with manual gearshift on the steering column.

Here is a non-random sample of some of the cars I’ve owned — cars where the model year ends in ‘9’. These are not the actual photos of my cars but they’re exactly like the ones I owned — expect for the Porsche — I still own it and drive it on sunny weekends.

Top row: 1969 Datsun 510, 1979 Mazda RX-7, 1989 Nissan 240SX. Bottom row: 1989 BMW 5-series, my actual 1999 Porsche Carrera parked next to my house, 1999 BMW 5-series.

Honorable Mention: I drove a red 1979 Triumph TR-7 when I lived in the Newport Beach CA area, but it was actually owned by my girlfriend at the time, Jane Morrison.

This entry was posted in Machine Learning. Bookmark the permalink.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s