Anomaly detection is the process of finding unusual data items. One standard approach is to cluster the data and then look at clusters with very few items, or at items that are far away from their cluster mean/average. Unfortunately, in most cases clustering works only with strictly numeric items (there are a few exceptions).
What do you do if your data is non-numeric or mixed numeric and non-numeric? I decided to explore an idea I’d seen in a couple of recent research journals: Create a deep neural autoencoder and then look at those items that the encoder has the most trouble with.
I zapped up a quick demo, and the idea seems to work. I used the UCI Digits Dataset. There are 1797 items. Each item has 64 predictor values which represent an 8×8 handwritten digit. I used PyTorch to create a 64-32-16-4-16-32-64 deep neural autoencoder. An autoencoder is a neural network that predicts its own input values.
One of the key ideas here is that unlike standard clustering, a neural autoencoder can deal with both numeric input and non-numeric input (that is encoded using 1-of-[N-1] or one-hot encoding).
After training the autoencoder, I walked through the 1797 items and found the one that gave the autoencoder the most trouble, meaning the item that gave the highest error between the 64 input pixel values and the 64 predicted pixel values.
It turned out that item  had the highest error. It was a handwritten ‘7’. I displayed item  and sure enough, it looked like an anomaly (meaning it didn’t look very much like a ‘7’).
A different approach (which I didn’t have time to explore) is to use an autoencoder architecture that has 64 nodes in the central hidden layer, then after training, when fed input values, that 64-value vector is a purely numeric representation of the input. Because the representation is numeric, k-means clustering can be applied. Then, after clustering it’s possible to find clusters with very few values, or find values in clusters that are far away from their cluster mean. A lot of ideas there.
Anomaly detection is a very difficult problem, but my experiment suggests that a deep neural autoencoder has good potential for tackling anomaly detection.