Reservoir Sampling

Consider the problem of grabbing n random items from a set of N items, where you don’t know in advance what N is. For example, suppose you have a huge data set (many millions or billions of items) and you want to select just 1,000 items. The usual approach to such a problem is to first scan through the data set to determine N, then use one of several well-known techniques. But surprisingly it is possible to get n random items even when you don’t know what N is, using just a single pass through the data. The technique is called reservoir sampling. It’s best explained by example.

Suppose you have a data set (which for simplicity is) stored in an array of size 10 (but we’ll imagine we don’t know the size) and you want to grab n=3 random items. We’ll assume the data array is 0-based and that each cell value is 100 plus the index. So, a[0] = 100, a[1] = 101, . . . and a[9] = 109.

The basic reservoir algorithm, in pseudo-code is:

create a result array of size n
store first n items in a[] into result
for each remaining item in a[] (indexed by t)
  pick a random index m between 0 and t
  if m < n replace result[m] with a[t]
end loop

So, initially, result = [100, 101, 102] and t is 3.

The first time through the loop m is a random value between 0 and 3, suppose it’s 1. Because m=1 is less than n=3, we replace result[1] with a[3] and result is now [100, 103, 102].

The second time through the loop, t = 4 and m is a random value between 0 and 4, suppose it’s 2. Because m=2 is less than n=3, we replace result[2] with a[4] and result is now [100, 103, 104].

The third time through the loop, t = 5 and m is a random value between 0 and 5, suppose it’s 4. Because m=4 is not less than n=3, result does not change.

The loop would continue until t = 9 and m is a random value between 0 and 9, suppose it’s 7. Because m=7 is not less than n=3, result does not change.

After the loop finishes, the values in result are a random sample. The basic idea is that each time through the loop, the probability of replacement gets smaller and smaller. This is counterbalanced by the fact that early values are subjected to many more chances for replacement. It’s not an entirely obvious algorithm but it’s pretty cool.

About these ads
This entry was posted in Machine Learning, Software Test Automation. Bookmark the permalink.