A Clustering Algorithm that Didn’t Work

I was talking to a sales guy from the well-known UBM media company (InformationWeek, Interop Conference). Young guy, very sharp. During our rambling conversation, at one point we chatted about the difference between engineers (like me) and sales people (like him). One difference is that engineers have good control over their outcomes — we write code and it works or it doesn’t. Salesmen can do everything right but still fail to close a deal for reasons they can’t control.

Anyway, I failed the other day when looking at a new way to cluster (group) categorical (non-numeric) data. The key to my idea is a metric called category utility (CU). It is a very clever way to give a numeric value that describes how good a particular clustering of non-numeric data is. The higher the CU, the better the clustering.

So, to cluster non-numeric data, you can try different clusterings and pick the one that gives the highest CU. Unfortunately, the number of possible clusterings for even a moderately sized dataset is unimaginably huge. For example, if you have just 100 items and you want to group them into 20 clusters, there are roughly 20^100 different ways to do this — a number far larger than the estimated number of atoms in the Universe — yes, the number of atoms.

I figured I’d experiment with an evolutionary approach. In pseudo-code:

create of population of possible solutions
loop max_generations times
  pick a good item
  pick another item
  combine the items to make a
    child item
  replace worst item in population
    with the new child
  randomly mutate an item in the
    population
end_loop
return best item found

Evolutionary algorithms like this can be remarkably effective, so I was optimistic my idea would work great.

It sort of worked but not very well.

The moral of the story is that even in fields like computer science, where you can always succeed given enough time and resources, sometimes you fail. My UBM sales friend told me that in his world failure happens all the time but you pick yourself up and learn what you can.



“Exploring Mars” (1953) – Chesley Bonestell

Advertisements
This entry was posted in Machine Learning. Bookmark the permalink.

One Response to A Clustering Algorithm that Didn’t Work

  1. PGT-ART says:

    Another time series problem, counting raindrops https://www.youtube.com/watch?v=tIJQkR-ofFo
    Any ideas about what kind of neural network would be able to count falling raindrops ?
    IT could aply to many other scenarios as wel ea counting people counting cars

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s