I wrote an article titled “Autoencoder Anomaly Detection Using PyTorch” in the April 2021 edition of the online Microsoft Visual Studio Magazine. See https://visualstudiomagazine.com/articles/2021/04/13/autoencoder-anomaly-detection.aspx.
Anomaly detection is the process of finding items in a dataset that are different in some way from the majority of the items. For example, you could examine a dataset of credit card transactions to find anomalous items that might indicate a fraudulent transaction. The article explains how to use a PyTorch neural autoencoder to find anomalies in a dataset.
The demo program for the article used the UCI Digits dataset. It consists of a 3,823-item file named optdigits.tra (intended for training) and a 1,797-item file named optdigits.tes (for testing). Each file is a simple, comma-delimited text file. Each line represents an 8 by 8 handwritten digit from “0” to “9.”
The UCI Digits data looks like:
0,1,6,16,12, . . . 1,0,0,13,0
2,7,8,11,15, . . . 16,0,7,4,1
. . .
The first 64 values on each line are the pixel grayscale values (0 = white, to 16 = black). The last value on the line is the digit (‘0’ to ‘9’).
Left: A demo run of the autoencoder anomaly detection system. It identified a ‘7’ as the most anomalous digit. The demo program works with image data but autoencoder anomaly detection can work with any kind of data. Right: Ten examples of images from the UCI Digits dataset. Each image is only 8 by 8 so the images are quite crude.
An autoencoder is a neural network that predicts its own input. An input image x, with 65 values normalized to between 0 and 1 is fed to the autoencoder. A first neural layer transforms the 65-values tensor down to 32 values. A second layer produces a core tensor with 8 values. The core 8 values generate 32 values, which in turn generate 65 values. The size of the first and last layers of an autoencoder are determined by the problem data, but the number of interior hidden layers, and the number of nodes in each hidden layer, are hyperparameters that must be determined by trial and error guided by experience.
The 65 output values of the autoencoder should be very close to the 65 input values. The difference between the input and output values is called the reconstruction error. Data items with low reconstruction error are normal, and items with large reconstruction error are anomalies that should be examined.
Anomaly detection using an autoencoder is simple and often quit effective. The technique is very good at finding data items where one of the components is off in some way. I’m also looking at a new technique for anomaly detection, called variational autoencoder reconstruction probability.
The natural birth (without the aid of fertility drugs) of identical triplets is an anomalous event. The exact probability is not known but one reasonable estimate is 1 per every 1,000,000 births. An Internet image search for “interesting identical triplets” returned these three sets. Left: Erin, Missy and Mandy Maynard are from Omaha, Nebraska. They founded a successful cosmetics company. Center: Ricky, Ralston, and Reiss Gabriel are from London. They tried to use their identical DNA to commit crimes without being identified. These police arrest mugshots indicate their plan did not work very well. Right: Laura, Nicola and Alison Crimmins are from Dublin, Ireland. They are successful models.