Uncertainty Sampling

Uncertainty sampling is a technique to select a subset of good training data from a very large set of data. The problem scenario isn’t very common. In most cases you want more training data, not less. But there are times when you want a sample of training data.

Suppose you have a huge dataset to use to train a complex neural network. What can you do if your dataset is just too large to deal with? Taking a random ample of data items usually doesn’t work very well, and trying to take a stratified sample is very difficult (and it doesn’t work very well either). An old approach that is often quite effective is called “uncertainty sampling”.

Expressed as very high-level pseudo-code:

select a small random sample of data
train an initial model

loop until satisfied
  use model to find a set of uncertain data items
  add uncertain items to data sample
  use new sample to train a new model
end-loop

In words, start with a small subset of data to create an initial model. Then repeatedly use the current model to predict unused data, looking for data items that are the most difficult to predict, add these “uncertain” data items to the sample of training data, then use the new sample to create a new model.

The idea is relatively simple. Here’s a concrete example.

Suppose you have a training dataset that has 100,000,000,000 labeled items, but your computing resources can realistically only deal with 100,000 training items. The ultimate goal is a classifier with k = 3 classes. A naive approach to reduce the size of the data is to just randomly select 100,000 items as a sample and use those to train your model. Naive sampling usually doesn’t work very well.

Instead, you start with 10,000 randomly selected items and use them to create a initial prediction model. Then you feed about 1,000,000 of the remaining items to the initial model and compute predictions. This process will be fast because you’re not doing any training. From the 1,00,000 predictions, you select 1,000 of the items that have the most uncertainty in their predictions. For example, a model output of (0.34, 0.35, 0.31) is much more uncertain than an output of (0.05, 0.92, 0.03). You add the 1,000 uncertain items to the initial sampled training data. Now you have 11,000 training items. You repeat this process, which will add 1,000 new data items to the sampled training data on each iteration.

The key idea is that by selecting data items that the model is uncertain about, you are adding the most information into the system. Put another way, you’re selecting data items that will do the most good. Randomly selecting data items usually selects data items that are similar, and so the selected items won’t add much value during training.

Uncertainty sampling is a meta-heuristic, not a rigid algorithm, and so there are endless variations possible. Uncertainty sampling can also be used with sparse-class data. Suppose your training data is 99% class 0 items (“no cyber attack”) and 1% class 1 items (“cyber attack”). You can use all the rare class 1 items, and then use uncertainty sampling to select an appropriate number of the common class 0 items.

A good introductory reference is https://storm.cis.fordham.edu/~gweiss/selected-papers/lewis94.pdf. I recently learned that a colleague of mine, Dr. Paul Mineiro, is an expert in this area of research. I haven’t looked at any of his work yet but I’ll do so when I get some free time.

Uncertainty sampling from a set of data is closely related to the idea of “coresets” — which loosely means getting an optimal (in some sense) sample. Uncertainty sampling is also related to the idea of dataset similarity — the idea that a sample should be similar (in some sense) to the larger dataset from which the sample is drawn. Dataset similarity is surprisingly tricky. I have a vague thought that it might be possible to measure dataset similarity based on expected calibration error, but that’s a topic for a future post.

The Heisenberg uncertainty principle says that there’s a limit on the accuracy with which you can measure related properties of a particle, such as its position and momentum. The uncertainty principle is closely related to, but different from, “the observer effect” which says you can’t observe something without changing it. Left: by artist Ed Fairburn. Center: by cartoonist John Chase. Right: by artist Barry Wilson.

3 Responses to Uncertainty Sampling

Peter Boos says:

January 13, 2021 at 1:10 pm

Makes me wonder if such sets would work better with genetic algorithms as those typically train much like that.
jamesdmccaffrey says:

January 14, 2021 at 6:26 am

Peter, I was thinking about the connection with genetic algorithms too, but the ideas are a bit hazy in my head.
Pingback: Testing Bits: 375 – January 10th, 2021 – January 16th, 2021 | Testing Curator Blog