Seven Deep Learning Techniques for Unsupervised Anomaly Detection

The goal of anomaly detection is to examine a set of data to find unusual data items. Three of the main approaches are 1.) rule based techniques, 2.) classification techniques from labeled training data, 3.) unsupervised techniques.

Suppose some source data looks like:

sex   age   city     income   job-type
 M    32    anaheim  $64,500  mgmt
 F    28    boulder  $34,000  sales
 M    40    concord  $71,500  tech
. . .

You want to scan through the data to find items that are different in some way from the others.

A rule based system handcrafts conditions such as, “If the age value is less than 14 and the income value is greater than $1,000,000 then flag item as anomalous.”

A machine learning classifier requires a large set of training data that has values plus a label of 0 = normal or 1 = anomaly, and then you train a neural network classifier (or k-NN, or logistic regression, or Bayesian classifier, or whatever) to create a prediction model.

Unsupervised techniques accept data without labels then build a prediction model from that raw data. Here is a brief description of seven machine learning unsupervised anomaly detection techniques. Five use deep learning techniques, one uses classical machine learning, and one uses reinforcement learning.

1. Autoencoder (AE) reconstruction error. An autoencoder is a deep neural system that learns to predict its own input. You train a model (you don’t need labels). Then you run each data item through the AE which will generate a close copy. For example, an input of (M, 37, anaheim, $54,500, sales) would be encoded and normalized to something like (0, 0.37, 1,0,0, 0.5450, 0,1,0) and the output might be (0, 0.35, 0,1,0, 0.5560, 0,1,0). You measure the difference between the actual input and the reconstruction to get reconstruction error. You do this for each data item. Items that have large reconstruction error are anomalous. This technique is fairly well known.

The autoencoder examined a set of digit images and flagged a ‘7’ as the most anomalous.

2. Isolation forest (IF) sampling anomaly score. You repeatedly take a sample of data, then construct a binary tree data structure. Anomalous data items will be separated quickly and be close to the tree root node. The huge datasets. average distance of a data item from the root (averaged over many samples) is an anomaly score — small values are more likely to be anomalous. This technique has some technical limitations but is fast and can handle huge datasets.

Example of an isolation forest examining dummy random data.

3. Variational autoencoder (VAE) reconstruction probability. A variational autoencoder is similar to a regular AE but a VAE learns a distribution instead of learning a math equation. You use all data items (normalized and encoded) to create a VAE. Then you feed each data item to the VAE and compute a reconstruction probability — the likelihood that the data item came from distribution learned by the VAE. Low probability values indicate anomalous data items. This technique is new and mostly unexplored.

Two experiments I did that use VAE reconstruction probability for anomaly detection. Results were promising but not conclusive.

4. Generative adversarial network (GAN) duplication error. A GAN is designed to generate synthetic data. You train a GAN. Normally, you feed random values to the GAN and it generates a synthetic data item which you use in some way. But for GAN anomaly detection, after training the GAN, you feed each source data item to the GAN and then compare the generated synthetic item with the source item. Large differences between a source item and its duplicate indicate an anomalous item. This technique currently exists only in my head — to the best of my knowledge it’s never been tried, perhaps in part because GANs are very difficult to work with.

5. Reinforcement Q-learning. In Q-learning, you have a set of states (such as positions in a maze) and a set of actions (such as move left), and you use something called the Bellman equation to find the quality (Q) of each possible action for each possible state. For anomaly detection, you use each data item as a state and the possible actions are moving to any other data item. The Q values are computed based on how different the current item is to the next item. This technique currently exists only in my head.

6. LSTM Architecture for time series data. An LSTM is a deep neural system used with sequential (time series) data. A sequence of items is fed one at a time to an LSTM which then predicts the next item in the series. After you train an LSTM you compare each item with the prediction generated by the input of the previous items. Data items where the prediction is far off are anomalous. LSTM anomaly detection tends to either work quite well or fail badly.

An example of an LSTM. The difference between the blue line (actual data) and the yellow line (predicted) is an indicator of anomalous data. Results were disappointing but not conclusive.

7. Transformer Architecture (TA) for time series data. A Transformer Architecture system is similar to an LSTM but instead of processing one data item at a time, a TA accepts a collection of data items and works on them all at the same time. TAs are extremely difficult to work with. The technique currently exists only in my head.

A person could literally spend their entire career exploring these anomaly detection techniques, and other techniques that would emerge from the initial explorations. Very interesting, and possibly very useful, ideas.

One of the reasons I like science fiction movies is that they sometimes come up with anomalous scenes — meaning cinematography ideas that had never been seen before. There are dozens of visually ultra-creative science fiction movies that have stunning cinematography. Here are three that have numeric titles. Left: Astronaut Frank Poole (played by Gary Lockwood) in “2001: A Space Odyssey” (1968). Center: Mr. Kim’s flying Chinese restaurant boat in “The Fifth Element” (1997). Right: Officer K (played by Ryan Gosling) walks through the ruins of Las Vegas in “Blade Runner 2049” (2017).

This entry was posted in Machine Learning. Bookmark the permalink.

Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s