Frequent Item-Sets for Association Rule Learning

I wrote an article titled, “Frequent Item-Sets for Association Rule Learning” in the January 2014 issue of MSDN Magazine. See The idea of frequent item-sets is sort of paradoxical in the sense that the idea is conceptually simple, but is difficult to explain exactly what the problem is. Let me try.

The classic example starts with a collection of a supermarket’s customer transactions. For example, one transaction might be (apples, bread, donuts). Each transaction consists of items from the supermarket’s inventory. An item-set is a possible transaction. For example, an item-set could be (bread, celery, icing) — even if no customer ever bought these three items together, in other words, (bread, celery, icing) is an item-set even if there is no (bread, celery, icing) transaction.

The problem is to find frequent item-sets, that is, those item-sets that occur frequently (for example, in more than 10% of all transactions) in the collection of actual transactions. If you know what the frequent item-sets are, this information can be used in several ways, such as placing the items in the frequent item-sets near each other, or by adjusting the prices of those frequent items.

As it turns out, harvesting frequent item-sets from a collection of transactions is an extremely difficult problem. A brute-force approach would be to generate all hypothetical item-sets, and then check each one to see how many times the item-set occurs in the actual transactions. This approach will not work because of the combinatorial explosion problem. For example, if a supermarket had 10,000 items (many supermarkets have much larger inventory), and even if you restrict potential frequent item-sets to just 9 items in size, there are Choose(10,000, 9) = 2,745,826,321,280,434,929,668,521,390,000 possible frequent item-sets to analyze.

My MSDN Magazine article shows an approach which uses something called the Apriori algorithm to extract frequent item-sets from a list of transactions.


This entry was posted in Machine Learning. Bookmark the permalink.