Principal component analysis (PCA) is a classical statistics technique that can do data dimensionality reduction. This can be used to graph high dimensional data (if you reduce the dim to 2), or to clean data (by reconstructing the data from its reduced form), or anomaly detection (by comparing reconstructed data items to original source items).

When I need to do dimensionality reduction, I almost always use a deep neural autoencoder. But classical PCA can be useful in some situations such as when you’re dealing with legacy code or when you want a baseline for comparison with other techniques.

*The results on the left, from the Scikit PCA, are the same as the results from my NumPy-scratch function on the right. Principal components are allowed to vary by sign (which is a weird idea).*

Computing PCA from absolute scratch is not feasible. PCA is one of the most complicated forms of numerical programming. At its heart PCA must compute eigenvalues and eigenvectors — extremely difficult. The most common way to perform PCA when using Python is to use the PCA() class in the Scikit library decomposition module. Using the Scikit PCA is fine but you take on a big dependency and get a lot of PCA functionality you may not need.

At one level of abstraction lower, you can compute PCA using either the NumPy or SciPy libraries. I set out one weekend morning to implement PCA using only NumPy. To cut to the chase, here is what I eventually came up with:

def my_pca(x): # returns transformed x, prin components, var explained dim = len(x[0]) # n_cols means = np.mean(x, axis=0) z = x - means # avoid changing x square_m = np.dot(z.T, z) (evals, evecs) = np.linalg.eig(square_m) # 'right-hand' trans_x = np.dot(z, evecs[:,0:dim]) prin_comp = evecs.T # principal components are eigenvecs T v = np.var(trans_x, axis=0, ddof=1) # col sample var sv = np.sum(v) ve = v / sv # order everything based on variance explained ordering = np.argsort(ve)[::-1] # sort order high to low trans_x = trans_x[:,ordering] prin_comp = prin_comp[ordering,:] ve = ve[ordering] return (trans_x, prin_comp, ve)

The function accepts a NumPy 2D array. It emits the data in transformed form, a 2D array of principal components that can be used to reconstruct the source data, and an array that holds the percentage of variance explained by each component.

To test my sort-of-from-scratch PCA function, I loaded the 150-item Iris Dataset (it has dim=4) and computed a variety of PCA-related values using both the Scikit library and my own function. The results are almost the same — some of the principal components vary by sign. This is OK because principal components can vary by sign, without affecting any values that use them. (This is a strange idea but explaining would take pages, so just trust me).

The trickiest part of my custom function was making sure that it returned everything (transformed data, principal components, variances) ordered by the percentage of variance explained by each component, so that output values matched the Scikit version.

In the end, my custom function isn’t directly useful: the code is a bit tricky and the small gain of avoiding the Scikit dependency isn’t worth the effort of a custom function. But the experiment was still a big win — I learned a lot and have a rock solid understanding of how PCA works.

*There are many science fiction movies where aliens disguise themselves as humans — sort of human reconstruction. Left: “It Came From Outer Space” (1953). An spacecraft with large jellyfish-like aliens crashes in the desert. The aliens are benign and can take on the appearance of humans to collect things they need to repair their spaceship, but the aliens can’t act 100% human (reconstruction error). The situation is resolved peacefully and the aliens resume their journey. Center: “Invaders From Mars” (1953). Martians land on Earth. They are most definitely not benign. The Martians can control people by inserting a device into their necks. Again, the controlled humans act a bit oddly. The aliens are defeated in the end. Right: “Invasion of the Body Snatchers” (1956). Alien pods show up in a small California town. The pods grow into human-shapes and then take over and dispose the source human if the human falls asleep.*

Code below. Long.

# pca_scratch_iris_reconstruction.py import numpy as np import sklearn.datasets import sklearn.decomposition def my_pca(x): # returns transformed x, prin components, var explained dim = len(x[0]) # n_cols means = np.mean(x, axis=0) z = x - means # avoid changing x square_m = np.dot(z.T, z) (evals, evecs) = np.linalg.eig(square_m) # 'right-hand' trans_x = np.dot(z, evecs[:,0:dim]) # eigenvecs used to transform x prin_comp = evecs.T # principal components are just eigenvecs T v = np.var(trans_x, axis=0, ddof=1) # col sample var sv = np.sum(v) ve = v / sv # order everything based on variance explained ordering = np.argsort(ve)[::-1] # sort order high to low trans_x = trans_x[:,ordering] prin_comp = prin_comp[ordering,:] ve = ve[ordering] return (trans_x, prin_comp, ve) print("\nBegin PCA reconstruction demo ") np.set_printoptions(precision=4, suppress=True, sign=" ") print("\nLoading Iris data into memory ") X = sklearn.datasets.load_iris().data # n=150 dim=4 # print("\nLoading Wine data into memory ") # too much var in 1st compponent # X = sklearn.datasets.load_wine().data # n=178 dim=13 # print("\nLoading Diabetes data into memory ") # X = sklearn.datasets.load_diabetes().data # n=442 dim=10 # print(X); input() print("First 2 data items are: ") print(X[0]) print(X[1]) print("\nComputing columns means (needed for reconstruction) ") mu = np.mean(X, axis=0) print("Means are: ") print(mu) print("\n==== using scikit PCA ==== ") print("\nComputing principal components ") pca = sklearn.decomposition.PCA() # all features pca.fit(X) print("Done ") print("\nApplying dimensionality reduction transformation ") trans = pca.transform(X) print("First two transformed data items are: ") print(trans[0]) print(trans[1]) print("\nFetching principal components ") comps = pca.components_ print("Principal components are: ") print(comps) print("\nVariance explained by each component: ") ve = pca.explained_variance_ratio_ print(ve) dim = 4 # reconstruct source data exactly print("\nReconstructing source data using all %d components " % dim) recons = np.dot(trans[:,0:dim], comps[0:dim,:]) # (150,4) x (4,4) recons += mu print("First two reconstructed data items are: ") print(recons[0]) print(recons[1]) dim = 2 print("\nReconstructing source data using just %d components " % dim) recons = np.dot(trans[:,0:dim], comps[0:dim,:]) # (150,2) x (2,4) recons += mu print("First two reconstructed data items are: ") print(recons[0]) print(recons[1]) input() print("\n===== using scratch PCA ===== ") x_trans, p_comp, var_exp = my_pca(X) print("\nfirst two transformed: ") print(x_trans[0]);print(x_trans[1]) print("\nPrincipal components: ") print(p_comp) print("\nVar explained: ") print(var_exp) dim = 4 # reconstruct source data exactly print("\nReconstructing source data using all %d components " % dim) recons = np.dot(x_trans[:,0:dim], p_comp[0:dim,:]) # (150,4) x (4,4) recons += mu print("First two reconstructed data items are: ") print(recons[0]) print(recons[1]) dim = 2 print("\nReconstructing source data using just %d components " % dim) recons = np.dot(x_trans[:,0:dim], p_comp[0:dim,:]) # (150,2) x (2,4) recons += mu print("First two reconstructed data items are: ") print(recons[0]) print(recons[1]) print("\nEnd demo ")

After seeing the screenshot, I thought you had cracked PCA. Then to read that the implementation is almost impossible was a bit disappointing. But your screenshot also shows that you are even closer to understanding every single detail.

It’s very strange with all the other explanations about PCA that seem to fully understand the topic. But apparently more in theory than in practice. One of the great things about your blog are the little behind the scenes looks. Sometimes when such extremely difficult implementations are touted as being so simple, it can become very disappointing to the reader. Especially when you get a high level idea of the technology and then think you can just implement it by yourself from scratch if you learn enough about it. The problem pops up even more in NLP techniques.

I am working on a spiral dataset that is still waiting for the first test, hope you like the idea. Also a way to come closer to the center.