Selecting n Distinct Array Indices

When I’m writing machine learning code, a common task I need to perform is to generate random array indices. For example, if I have an array with 10 cells, I might want to generate 3 random indices, such as 4, 6, 5. In other words, for an array with length k, the indices are 0 through k-1, and I need to generate n indices that are all different.

RandomArrayIndices

There are several ways to do this. One approach is a brute force method:

loop
  pick n random values between 0 and k-1
  if all values different
    return the values
end loop
return null

The brute force approach is very effective when n, the number of indices to generate, is much smaller than k, the array length, because the chance of getting duplicate values is very small. The problem is that there’s no guarantee you’ll ever get n different values.

A second approach is to use the Reservoir algorithm. It’s quite tricky and I described it in an earlier blog post. The Reservoir algorithm is guaranteed to succeed, but the downside is you have to iterate k times. If the indices-generating routine must be called many times, as is often the case with ML code, you can get bad performance.

So, the approach I sometimes use is this:

set trial := 0
set maxTrials := k * 2
loop maxTrials
  pick n random values between 0 and k-1
  if all values different
    return the values
  trial := trial + 1
end loop
if didn't get good values
  return Reservoir(n, k)

In other words, I first try the quick, but not guaranteed brute force approach, and then if that fails, I resort to the slower, but guaranteed, Reservoir algorithm. Here’s a C# implementation.

private static int[] RandomIndices(int n,
  int range, int seed)
{
  Random rnd = new Random(seed);
  int maxAttempts = range * 2;
  int[] result;
  result = new int[n];
  for (int i = 0; i < maxAttempts; ++i)
  {
    for (int j = 0; j < n; ++j)
      result[j] = rnd.Next(0, range);

    Array.Sort(result);
    if (HasDups(result) == false)
      return result;
  }
  // OK, use Reservoir
  for (int i = 0; i < n; ++i)
    result[i] = i;

  for (int t = n; t < range; ++t)
  {
    int m = rnd.Next(0, t + 1);
    if (m < n)
      result[m] = t;
  }
  Array.Sort(result);
  return result;
}

private static bool HasDups(int[] arr)
{
  for (int i = 0; i < arr.Length-1; ++i)
    if (arr[i] == arr[i + 1])
      return true;
  return false;
}

One possible enhancement I’ve considered is to first check if n and k (“range” in the code) are “close to each other”, and if so, go directly to the Reservoir part of the code because the brute force approach will likely fail.

Advertisements
This entry was posted in Machine Learning. Bookmark the permalink.