Converting Non-Numeric or Mixed Data to Strictly Numeric Data

Like many topics in machine learning, this idea is a bit tricky to explain so bear with me. My original problem was data clustering. Every standard clustering technique, in particular k-means, requires the source data to be completely numeric because you must compute a distance value between different data items (usually using Euclidean distance).

But what if your data has some non-numeric data. For example, imagine some fake flower data:

blue  5.1  3.5  1.4  0.2
pink  4.9  3.0  1.4  0.4
teal  4.7  3.2  1.3  0.3

The first value is the color of the flower then the next four values are sepal length, sepal width, petal length, petal width. How do you deal with the color variable if you want to cluster the dataset? If you’re not familiar with clustering you’d think this would be easy, but trust me, it’s not.

My idea is to convert mixed data into strictly numeric data by using a neural autoencoder. I ran a little experiment where I first encoded the color as blue = (1, 0), pink = (0, 1), teal = (-1, -1) then I created a 6-10-8-10-6 autoencoder that accepts the six input values and predicts those same values. After training, the 8 nodes in the central hidden layer are a strictly numeric representation of each flower. . .

At least in theory. My scheme is somewhat related to and based on word embeddings where words are converted to numeric vectors in a roughly similar way.

Anyway, there’s a lot going on here and I’ve found almost zero existing research or practical information along these lines. One of the problems when working with clustering is that a good clustering is very hard to define precisely. But I’ll keep probing away at this problem.

“Good” art is impossible to define precisely. Five examples from artists whose work is often described as kitsch/bad. Beauty is in the eye of the beholder but I think all five paintings are wonderful. “The Green Lady”, Vladimir Tretchikoff. “Lamplight Manor”, Thomas Kinkade. “Tina”, JH Lynch. “Dancers in the Koutoubia Palace”, LeRoy Neiman. “Gypsy Girl”, Charles Roka.

This entry was posted in Machine Learning. Bookmark the permalink.