I ran into an interesting mini-problem recently. I had a collection of items, where each item had a value and I wanted to select one item in a way where the probability of being selected was proportional to the value.

Suppose there are 4 items (with IDs 0, 1, 2, 3) and the values of the four items are 20, 40, 10, 30. The sum of the four values is 20 + 40 + 10 + 30 = 100. I want to select item 0 with probability = 20/100 = 0.2, and item 1 with probability 40/100 = 0.4, etc. In other words, items with higher values have a higher likelihood of being selected.

This mini-problem occurs in genetic algorithms where the items are genetic individuals and the values are their fitness. Individuals with higher fitness values are selected for breeding. The mini-problem, sometimes called fitness proportionate selection, occurs inside other algorithms too.

There are several ways to perform fitness proportionate selection. When I implement genetic algorithms, I almost always use a technique called tournament selection because it is very easy to code.

But I was looking at a different problem — k-means++ initial means selection. Here the items are data tuples and their values are squared-distance to the closest existing mean. I wanted to select a data item that has a high distance-squared value as the next initial mean, because you want the means to be different from each other.

Anyway, for the k-means++ initial means mini-problem, a technique called roulette wheel selection was nicer than tournament selection. The reason why roulette wheel is better is complicated, but basically, roulette wheel allowed me to easily avoid selecting duplicate items.

The core roulette wheel code looks like this:

// pick a data item, using the squared-distances // this is a form of roulette wheel selection double p = rnd.NextDouble(); double sum = 0.0; // sum of distances-squared for (int i = 0; i less-than distSquared.Length; ++i) sum += distSquared[i]; double cumulative = 0.0; // cumulative prob int ii = 0; // points into distSquared[] int sanity = 0; // sanity count while (sanity less-than data.Length * 10) { cumulative += distSquared[ii] / sum; if (cumulative greater-than p && used.Contains(ii) == false) { newMean = ii; // the chosen index used.Add(newMean); // don't pick again break; } ++ii; // next candidate if (ii greater-than-or-equal distSquared.Length) ii = 0; // back to first item ++sanity; } // save the data of the chosen index Array.Copy(data[newMean], means[k], data[newMean].Length);

The code above has lots of missing pieces because the complete code is quite long.