Researchers Explore Intelligent Sampling of Huge ML Datasets to Reduce Costs and Maintain Model Fairness

I contributed to an article titled “Researchers Explore Intelligent Sampling of Huge ML Datasets to Reduce Costs and Maintain Model Fairness” in the May 2021 edition of the online Pure AI site. See https://pureai.com/articles/2021/05/03/intelligent-ai-sampling.aspx.

Researchers devised a new technique to select an intelligent sample from a huge file of machine learning training data. The technique is called loss-proportional sampling. Briefly, a preliminary, crude prediction model is created using all source data, then loss (prediction error) information from the crude model is used to select a sample that is superior to a randomly selected sample.

The researchers demonstrated that using an intelligent sample of training data can generate prediction models that are fair. Additionally, the smaller size of the sample data enables quicker model training, which reduces the electrical energy required to train, which in turn reduces the CO2 emissions generated while training the model.

The ideas are best explained by an artificial example. Suppose you want to create a sophisticated machine learning model that predicts the credit worthiness of a loan application, 0 = reject application, 1 = approve application. You have an enormous file of training data — perhaps billions of historical data items. Each data item has predictor variables such as applicant age, sex, race, income, debt, savings and so on, and a class label indicating if the loan was repaid, 0 = failed to repay, 1 = successfully repaid loan.

To create an intelligent loss-proportional sample, you start by creating a crude binary classification model using the entire large source dataset. The most commonly used crude binary classification technique is called logistic regression. Using modern techniques, training a logistic regression model using an enormous data file is almost always feasible, unlike creating a more sophisticated model using a deep neural network.

After you have trained the crude model, you run all items in the large source dataset through the model. This will generate a loss (error) value for each source item, which is a measure of how far the prediction is from the actual class label. For example, the loss information might look like:


(large source dataset)
item  prediction  actual   loss   prob of selection
[0]    0.80         1      0.04   1.5 * 0.04 = 0.06
[1]    0.50         0      0.25   1.5 * 0.25 = 0.37
[2]    0.90         1      0.01   1.5 * 0.01 = 0.02
[3]    0.70         1      0.09   1.5 * 0.09 = 0.13
. . .
[N]    0.85         1      0.02   1.5 * 0.02 = 0.03

Here the loss value is the square of the difference between the prediction and the actual class label, but there are many other loss functions that can be used. In general, the loss values for data items that have rare features will be greater than the loss values for normal data items.

Next, you map the loss for each source data item to a probability. In the example above, each loss value is multiplied by a constant lambda = 1.5. Now, suppose you want a sample of data that is 10 percent of the size of the large source data. You iterate through the source dataset. For each item, you select it and add it to the sample with its associated probability. In the example above, item [0] would be selected with prob = 0.06 (likely not selected), then item [1] would be selected with prob = 0.37 (more likely) and so on. You repeat until the sample has the desired number of data items.

The ideas are fully explained in a 2013 research paper titled “Loss-Proportional Subsampling for Subsequent ERM” by P. Mineiro and N. Karampatziakisis. The paper is available online in several locations. Note that the title of the paper uses the term “subsampling” rather than “sampling.” This just means that a large source dataset is considered to be a sample from all possible problem data. Therefore, selecting from the source dataset gives a subsample.

In the early days of machine learning, there was often a lack of labeled training data. But, increasingly, machine learning efforts have access to enormous datasets, which makes techniques for intelligent sampling more and more important. The history of computer hardware and software is fascinating.


Left: Nicolas Temese created this beautiful diorama of an IBM 1401 mainframe computer system. The 1401 was introduced in 1959. It was one of the very first mass-produced machines. In the early 1960s there were about 20,000 computers on the entire planet and about half of these were 1401s. Right: This is a 1/12 scale diorama of Tony Stark’s (Iron Man) workshop system. It was created by Sherwyn Lazaga who works for a company called StudioGenesis.


This entry was posted in Machine Learning. Bookmark the permalink.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s