Converting Numeric Data to Categorical Data

In the August 2013 issue of MSDN Magazine, I explain how to convert numeric data into categorical data. See Machine learning often deals with two kinds of data: numeric data such as a person’s height in inches, and categorical data such as a person’s eye color. Some machine learning algorithms work only with numeric data. Logistic regression is one example. And some machine learning algorithms work only with categorical data. Naive Bayes classification is one example. So, in situations where you want to use a machine learning algorithm that works only with categorical data, but some or all of your raw data set contains numeric data, you may want to convert the numeric data to categorical data. For example, suppose your raw data is people’s heights measured in inches and you want to use Naive Bayes classification. You can convert raw height values such as 68.0 and 79.0 to categorical data such as “medium” and “tall” (or equivalently “1” or “2” where “0” means short, “1” means medium, and “2” means tall).

There are several ways to convert numeric data to categorical data. In the MSDN Magazine article I describe a relatively sophisticated technique based on the k-means clustering algorithm. The idea is to group similar data items together and then use group IDs as the category label. The article also briefly discusses two simpler alternatives: dividing numeric data into equally spaced intervals, and dividing data into groups with equal frequencies. By the way, the “other-direction” problem, that is, converting categorical data into numeric data, is also important. That process is often called data encoding.



This entry was posted in Machine Learning, Software Test Automation. Bookmark the permalink.