A common machine learning task is converting numeric data into categorical data. For example, suppose you have a set of employee annual incomes with values such as $48,000.00 and $39,500.00 and you want to transform them into values such as “low”, “medium”, and “high”. This is called discretization or, informally, binning the data. Binning is done for two main reasons. First, some ML algorithms expect categorical data. Second, binning tends to smooth out differences between data points which is sometimes, but not always, useful.

There are three main ways to bin data. The simplest is to create intervals, or “buckets”, of equal width. The second approach, equal frequency, is to create buckets so that each bucket has an (approximately) equal number of data values in it. The third approach is to create buckets using a clustering algorithm such as k-means so that data values are grouped by similarity to each other. Each of the three techniques has significant pros and cons so there is not one clear best way to bin data.

There are many, many ways to implement equal width binning. Here’s one example. Suppose the data to bin is below and you want 4 bins:

2.0, 3.0, 4.0, 5.0, 6.0, 8.0, 9.0, 10.0, 12.0, 14.0

I’ve ordered the data, but equal width binning doesn’t require it. First, the min and max values are found:

min = 2.0 max = 14.0

Next the width of each bucket is computed:

width = (max - min) / number-buckets = (14.0 - 2.0) / 4 = 3.0

Next, the intervals are constructed:

[2.0, 5.0) [5.0, 8.0) [8.0, 11.0) [11.0, 14.0) -> data [0] [1] [2] [3] [4] [5] [6] [7] -> index {0} {1} {2} {3} -> bin

Next, the two end intervals are modified to catch outliers:

[-inf, 5.0) [5.0, 8.0) [8.0, 11.0) [11.0, +inf) [0] [1] [2] [3] [4] [5] [6] [7] {0} {1} {2} {3}

Now you can bin. Suppose you have some value x = 9.5. It belongs in bin {2} so it maps to categorical value “2”.

In the approach above, most interval values are stored twice. This isn’t necessary but gives better clarity at the expense of a little bit of extra storage.

In C# code, without error-checking:

static double[] MakeIntervals(double[] data, int numBins) { double max = data[0]; // find min & max double min = data[0]; for (int i = 0; i < data.Length; ++i) { if (data[i] < min) min = data[i]; if (data[i] > max) max = data[i]; } double width = (max - min) / numBins; // compute width double[] intervals = new double[numBins * 2]; // intervals intervals[0] = min; intervals[1] = min + width; for (int i = 2; i < intervals.Length - 1; i += 2) { intervals[i] = intervals[i - 1]; intervals[i + 1] = intervals[i] + width; } intervals[0] = double.MinValue; // outliers intervals[intervals.Length - 1] = double.MaxValue; return intervals; }

And the partner method:

static int Bin(double x, double[] intervals) { for (int i = 0; i < intervals.Length - 1; i += 2) { if (x >= intervals[i] && x < intervals[i + 1]) return i / 2; } return -1; // error }

Calling the methods could resemble:

double[] data = new double[] { 2.0, 3.0, . . , 14.0 }; int numBins = 4; double[] intervals = MakeIntervals(data, numBins); double x = 9.5; int bin = Bin(x, intervals); Console.WriteLine("\nBin of " + x + " is " + bin);

Instead of the static method approach, you could use an OOP approach, putting MakeIntervals code into a constructor and Bin code into a member method. Calling would look like:

double[] data = new double[] { 2.0, 3.0, . . , 14.0 }; int numBins = 4; Binner b = new Binner(data, numBins); double x = 9.5; int bucket = b.Bin(x); Console.WriteLine("\nBin of " + x + " is " + bucket);

Binning code isn’t too difficult but as much as any code I deal with, has an absolutely huge number of options for implementation.