## Classification, Clustering, and Rule Set Extraction

I’ve been working on a set of related programming projects over the past couple of weeks. Classification, cluster analysis, and rule set extraction are closely related topics. Suppose you have a set of data points (also called vectors or tuples) of some sort. These data points could be numeric abstractions such as geometric points, like (0, 3, -1), or the data points might be rows of a SQL database like (Smith, Stan, \$21.33, Developer). Now suppose you have a set of known categories, such as c0 = "likely to vote Democratic", c1 = "likely to vote Republican", and so on. Programmatic classification is the process of assigning each data point to a particular category. Programmatic clustering is similar to classification except that you don’t have known categories; instead the data points are grouped together into clusters of similar data points. Both classification and clustering can be supervised or unsupervised. With a supervised approach, a set of preliminary training data points are manually classified or clustered, and then this information is used to classify or cluster additional new data points. There is a huge body of research on classification and cluster analysis. However, the majority of this research deals with purely numerical data such as (3.0, 5.0, 2.0). There is much less research on categorical data such as (red, small, hot). The main reason for this is that most classification and clustering algorithms rely on some form of a difference function. It’s not too hard to compute a number which represents the difference between (2.0, 3.0, 4.0) and (1.0, 3.5, 2.7), but it’s a harder problem to determine the difference between (red, small, hot) and (blue, large, cold). Anyway, I’ve found what I believe to be some very cool new ways to perform classification and clustering of categorical data. The topic of rule set extraction enters the mix then: after clustering your data, how can you extract a set of if..then rules that correspond to the clustering result? Again, I’m working on some ideas that really fascinate me.