I came up with a novel way to measure the distance between two non-numeric items. Let me explain.
Suppose you have two numeric items like (5.0, 2.0, 3.0) and (4.0, 0.0, 3.0). You can easily calculate a distance between the items. For example, the Euclidean distance is sqrt((5-4)^2 + (2-0)^2 + (3-3)^2) = sqrt(5) = 2.24.
But if the items are non-numeric, calculating a meaningful distance is very hard. Suppose v1 = (red, short, heavy) and v2 = (blue, medium, heavy). You could say that the two items differ in two spots so their distance is 2.0. But this doesn’t take into account that there could be a dozens colors (red, blue, green, . . . ) and only three lengths (short, medium, long). A full discussion of the difficulties of measuring distance between non-numeric items would takes pages. Just trust me, it’s an extremely difficult problem.
My novel idea uses what’s called category utility (CU). CU is a metric I ran across in an obscure research paper several years ago. CU measures the goodness of a clustering of a dataset of non-numeric data. CU works by computing a measure of theoretic information gain. Larger CU values mean more information gain due to the clustering.
So my idea is this. If you want to measure the distance between item v1 and v2 that belong to a non-numeric dataset, calculate a CU1 for a clustering that has item v1 in a cluster by itself (and all the other items in a second cluster). Then calculate CU2 for a clustering that has item v2 in a cluster by itself (and all the other items in a second cluster). The difference in CU values is a measure of the difference in information gain, and can be used as a distance.
I coded up a demo, and somewhat to my surprise, the idea seems to work very well. This could be an important result and I wish I had time to explore it.