I was working on an anomaly detection system recently. The system used a deep neural autoencoder. As part of the system evaluation, we looked at anomaly detection using principal component analysis (PCA). PCA is a classical statistics technique that decomposes source data in a very clever, complicated way. If you only look at part of the decomposition components, you are performing a kind of dimensionality reduction.
You can use the decomposition to reconstruct a prediction of the original source data. If you compare the reconstructed data with the original source data and compute error as the difference between source and reconstruction, items with large reconstruction error are anomalous in some way.
I coded up a partial demo of anomaly detection using PCA reconstruction error. I used several resources I found on the Internet, especially a response to a question at stats.stackexchange.com/questions/229092/how-to-reverse-pca-and-reconstruct-original-variables-from-several-principal-com.
The source data is the well-known Fischer Iris Dataset. It consists of 150 items. Each data item has four values: sepal length and width, and petal length and width.
The demo loads the 150-item Iris source data into memory then computes PCA components and uses that information to transform the source data. Because each data item has four values, there are four principal components. It is possible to completely reconstuct the source data by using all four principal components, but this isn’t useful for anomaly detection. The demo code uses just 2 of the 4 components to reconstruct the source data.
The first two data source items and their reconstructions are:
source items: [ 5.1 3.5 1.4 0.2] [ 4.9 3.0 1.4 0.2] reconstructions: [ 5.0830 3.5174 1.4032 0.2135] [ 4.7463 3.1575 1.4636 0.2402]
The reconstructions are very close to the source items. This make sense because the Iris data is very simple and there aren’t any major anomalies. To complete an anomaly detection system, I’d compute squared error between each source item and its reconstruction, then sort by error from large to small.
Based on my experience, anomaly detection based on neural autoencoder reconstruction error works better than detection based on PCA reconstruction error. But even in the worst case scenario, PCA reconstruction error anomaly detection can serve as a baseline for evaluation of other techniques.
Forensic reconstruction takes human remains and uses a variety of techniques to create a lifelike image that’s often very close to what the deceased person truly looked like. Left: An Egyptian girl known as “Meritamun” was about 20 years old when she died approximately 2,000 years ago. Her light complexion indicates she was a member of the upper class rather than a dark-skinned slave. Center: A woman from Britain. She lived approximately 5,000 years ago during the neolithic era and was about 22 years old when she died. Right: “Jane” was about 15 years old when she died during a famine in 1609 in Jamestown, Virginia, the first English settlement in the Americas.
# pca_iris_reconstruction.py import numpy as np import sklearn.datasets import sklearn.decomposition print("\nBegin PCA reconstruction demo ") np.set_printoptions(precision=4, suppress=True, sign=" ") print("\nLoading Iris data into memory ") X = sklearn.datasets.load_iris().data print("First 2 data items are: ") print(X) print(X) print("\nComputing columns means (needed for reconstruction) ") mu = np.mean(X, axis=0) print("Means are: ") print(mu) print("\nComputing principal components ") pca = sklearn.decomposition.PCA(n_components=4) pca.fit(X) print("Done ") print("\nApplying dimensionality reduction transformation ") trans = pca.transform(X) print("First two transformed data items are: ") print(trans) print(trans) print("\nFetching principal components ") comps = pca.components_ print("Principal components are: ") print(comps) print("\nVariance explained by each component: ") ve = pca.explained_variance_ratio_ print(ve) dim = 2 print("\nReconstructing source data using %d components " % dim) recons = np.dot(trans[:,0:dim], comps[0:dim,:]) recons += mu print("First two reconstructed data items are: ") print(recons) print(recons) print("\nEnd demo ")