I was talking to a sales guy from the well-known UBM media company (InformationWeek, Interop Conference). Young guy, very sharp. During our rambling conversation, at one point we chatted about the difference between engineers (like me) and sales people (like him). One difference is that engineers have good control over their outcomes — we write code and it works or it doesn’t. Salesmen can do everything right but still fail to close a deal for reasons they can’t control.
Anyway, I failed the other day when looking at a new way to cluster (group) categorical (non-numeric) data. The key to my idea is a metric called category utility (CU). It is a very clever way to give a numeric value that describes how good a particular clustering of non-numeric data is. The higher the CU, the better the clustering.
So, to cluster non-numeric data, you can try different clusterings and pick the one that gives the highest CU. Unfortunately, the number of possible clusterings for even a moderately sized dataset is unimaginably huge. For example, if you have just 100 items and you want to group them into 20 clusters, there are roughly 20^100 different ways to do this — a number far larger than the estimated number of atoms in the Universe — yes, the number of atoms.
I figured I’d experiment with an evolutionary approach. In pseudo-code:
create of population of possible solutions loop max_generations times pick a good item pick another item combine the items to make a child item replace worst item in population with the new child randomly mutate an item in the population end_loop return best item found
Evolutionary algorithms like this can be remarkably effective, so I was optimistic my idea would work great.
It sort of worked but not very well.
The moral of the story is that even in fields like computer science, where you can always succeed given enough time and resources, sometimes you fail. My UBM sales friend told me that in his world failure happens all the time but you pick yourself up and learn what you can.