Many machine learning algorithms work only on either continuous numeric data (such as heights in inches — 67.5 inches, 70.2 inches, etc.) or work only on categorical data (such as car color — red, white, etc.) For example, k-means data clustering works only with continuous/numeric data but CU (category utility) clustering works only with categorical data. And logistic regression classification and prediction works only with continuous data but naive Bayes classification and prediction works only with categorical data. There are many other examples.
A fundamental technique in machine learning is to convert numeric data into categorical data so that an algorithm that works only with categorical data can be applied to the data under investigation. For example, consider clustering mixed numeric and categorical data. If the numeric data columns can be converted into categorical data, then the powerful CU clustering algorithm can then be applied to the entire data set. By the way, it is possible to convert categorical data into numeric data using 1-of-n or 1-of-(n-1) encoding but that’s another story.
Perhaps because discretization of continuous data isn’t a very flashy topic, even though there is a fairly good amount of research on the topic, there hasn’t been nearly as much research as I’d guessed there’d be given the topic’s fundamental importance.
There are three basic approaches to unsupervised (meaning no so-called training data is used) discretizing numeric data. Suppose we have heights of 16 people: 64, 66, 66, 68, 68, 68, 70, 70, 70, 70, 72, 72, 72, 74, 74, 76.
The first approach is to sort the data, and assign equal numbers of data points to each category label. For example, if we have n=4 bins, then the first four heights get category label “0″, the second four heights get category “1″, the third four heights are “2″, and the last four heights are “3″. So heights 64, 66, 66, 68 map to label “0″, and 68, 68, 70, 70 map to label “1″ and . . . oops you can see that we have a problem because some heights are mapping to different category labels. Even if we fix this minor but annoying technical problem, the equal-frequency approach doesn’t take into account natural breaks in the data.
A second approach to unsupervised discretization of numeric data is to create equal intervals. For example, the range of the example data is 76 – 64 = 12 inches. If there are n=4 bins then the intervals are [64-67), [67-70), [70-73), [73-76] where I’ve used square brackets for inclusive and parentheses for exclusive. This approach also ignores natural breaks in the data. A variation on the equal-interval approach is to bin data according to the data’s standard deviation.
The third approach to unsupervised discretization of continuous data is to use clustering. The idea is to cluster the numeric data using Euclidean distance. The result will generally take into account natural breaks in the data. Then the category label for each numeric data point is its assigned zero-based cluster number.
So, this is almost a bit recursive: if the primary problem is to cluster mixed categorical and numeric data, we’d like to use CU clustering which works only on categorical data. We can turn the numeric data into categorical data using clustering which works only on numeric data.
I’ve experimented with the third approach with pretty good results. An unanswered question is how to determine the optimal number of clusters — both for the preliminary clustering of each numeric data column, and the primary CU clustering of the entire data set. I have written up a detailed explanation of this topic, with source code. It is scheduled to appear in Microsoft’s MSDN Magazine in the July 2013 issue.