Computing PCA Using NumPy Without Scikit

Principal component analysis (PCA) is a classical statistics technique that can do data dimensionality reduction. This can be used to graph high dimensional data (if you reduce the dim to 2), or to clean data (by reconstructing the data from its reduced form), or anomaly detection (by comparing reconstructed data items to original source items).

When I need to do dimensionality reduction, I almost always use a deep neural autoencoder. But classical PCA can be useful in some situations such as when you’re dealing with legacy code or when you want a baseline for comparison with other techniques.


The results on the left, from the Scikit PCA, are the same as the results from my NumPy-scratch function on the right. Principal components are allowed to vary by sign (which is a weird idea).

Computing PCA from absolute scratch is not feasible. PCA is one of the most complicated forms of numerical programming. At its heart PCA must compute eigenvalues and eigenvectors — extremely difficult. The most common way to perform PCA when using Python is to use the PCA() class in the Scikit library decomposition module. Using the Scikit PCA is fine but you take on a big dependency and get a lot of PCA functionality you may not need.

At one level of abstraction lower, you can compute PCA using either the NumPy or SciPy libraries. I set out one weekend morning to implement PCA using only NumPy. To cut to the chase, here is what I eventually came up with:

def my_pca(x):
  # returns transformed x, prin components, var explained
  dim = len(x[0])  # n_cols
  means = np.mean(x, axis=0)
  z = x - means  # avoid changing x
  square_m = np.dot(z.T, z)
  (evals, evecs) = np.linalg.eig(square_m)  # 'right-hand'
  trans_x = np.dot(z, evecs[:,0:dim])
  prin_comp = evecs.T  # principal components are eigenvecs T
  v = np.var(trans_x, axis=0, ddof=1)  # col sample var
  sv = np.sum(v)
  ve = v / sv
  # order everything based on variance explained
  ordering = np.argsort(ve)[::-1]  # sort order high to low
  trans_x = trans_x[:,ordering]
  prin_comp = prin_comp[ordering,:]
  ve = ve[ordering]
  return (trans_x, prin_comp, ve)

The function accepts a NumPy 2D array. It emits the data in transformed form, a 2D array of principal components that can be used to reconstruct the source data, and an array that holds the percentage of variance explained by each component.

To test my sort-of-from-scratch PCA function, I loaded the 150-item Iris Dataset (it has dim=4) and computed a variety of PCA-related values using both the Scikit library and my own function. The results are almost the same — some of the principal components vary by sign. This is OK because principal components can vary by sign, without affecting any values that use them. (This is a strange idea but explaining would take pages, so just trust me).

The trickiest part of my custom function was making sure that it returned everything (transformed data, principal components, variances) ordered by the percentage of variance explained by each component, so that output values matched the Scikit version.

In the end, my custom function isn’t directly useful: the code is a bit tricky and the small gain of avoiding the Scikit dependency isn’t worth the effort of a custom function. But the experiment was still a big win — I learned a lot and have a rock solid understanding of how PCA works.



There are many science fiction movies where aliens disguise themselves as humans — sort of human reconstruction. Left: “It Came From Outer Space” (1953). An spacecraft with large jellyfish-like aliens crashes in the desert. The aliens are benign and can take on the appearance of humans to collect things they need to repair their spaceship, but the aliens can’t act 100% human (reconstruction error). The situation is resolved peacefully and the aliens resume their journey. Center: “Invaders From Mars” (1953). Martians land on Earth. They are most definitely not benign. The Martians can control people by inserting a device into their necks. Again, the controlled humans act a bit oddly. The aliens are defeated in the end. Right: “Invasion of the Body Snatchers” (1956). Alien pods show up in a small California town. The pods grow into human-shapes and then take over and dispose the source human if the human falls asleep.


Code below. Long.

# pca_scratch_iris_reconstruction.py

import numpy as np
import sklearn.datasets
import sklearn.decomposition

def my_pca(x):
  # returns transformed x, prin components, var explained
  dim = len(x[0])  # n_cols
  means = np.mean(x, axis=0)
  z = x - means  # avoid changing x
  square_m = np.dot(z.T, z)
  (evals, evecs) = np.linalg.eig(square_m)  # 'right-hand'
  trans_x = np.dot(z, evecs[:,0:dim])  # eigenvecs used to transform x
  prin_comp = evecs.T  # principal components are just eigenvecs T
  v = np.var(trans_x, axis=0, ddof=1)  # col sample var
  sv = np.sum(v)
  ve = v / sv
  # order everything based on variance explained
  ordering = np.argsort(ve)[::-1]  # sort order high to low
  trans_x = trans_x[:,ordering]
  prin_comp = prin_comp[ordering,:]
  ve = ve[ordering]

  return (trans_x, prin_comp, ve)

print("\nBegin PCA reconstruction demo ")
np.set_printoptions(precision=4, suppress=True, sign=" ")

print("\nLoading Iris data into memory ")
X = sklearn.datasets.load_iris().data  # n=150 dim=4

# print("\nLoading Wine data into memory ")  # too much var in 1st compponent
# X = sklearn.datasets.load_wine().data  # n=178 dim=13

# print("\nLoading Diabetes data into memory ")
# X = sklearn.datasets.load_diabetes().data  # n=442 dim=10
# print(X); input()

print("First 2 data items are: ")
print(X[0])
print(X[1])

print("\nComputing columns means (needed for reconstruction) ")
mu = np.mean(X, axis=0)
print("Means are: ")
print(mu)

print("\n==== using scikit PCA ==== ")

print("\nComputing principal components ")
pca = sklearn.decomposition.PCA()  # all features
pca.fit(X)
print("Done ")

print("\nApplying dimensionality reduction transformation ")
trans = pca.transform(X)
print("First two transformed data items are: ")
print(trans[0])
print(trans[1])

print("\nFetching principal components ")
comps = pca.components_
print("Principal components are: ")
print(comps)

print("\nVariance explained by each component: ")
ve = pca.explained_variance_ratio_
print(ve)

dim = 4  # reconstruct source data exactly
print("\nReconstructing source data using all %d components " % dim)
recons = np.dot(trans[:,0:dim], comps[0:dim,:])  # (150,4) x (4,4)
recons += mu
print("First two reconstructed data items are: ")
print(recons[0])
print(recons[1])

dim = 2
print("\nReconstructing source data using just %d components " % dim)
recons = np.dot(trans[:,0:dim], comps[0:dim,:])  # (150,2) x (2,4)
recons += mu
print("First two reconstructed data items are: ")
print(recons[0])
print(recons[1])

input()

print("\n===== using scratch PCA ===== ")

x_trans, p_comp, var_exp = my_pca(X)
print("\nfirst two transformed: ")
print(x_trans[0]);print(x_trans[1])
print("\nPrincipal components: ")
print(p_comp)
print("\nVar explained: ")
print(var_exp)

dim = 4  # reconstruct source data exactly
print("\nReconstructing source data using all %d components " % dim)
recons = np.dot(x_trans[:,0:dim], p_comp[0:dim,:])  # (150,4) x (4,4)
recons += mu
print("First two reconstructed data items are: ")
print(recons[0])
print(recons[1])

dim = 2
print("\nReconstructing source data using just %d components " % dim)
recons = np.dot(x_trans[:,0:dim], p_comp[0:dim,:])  # (150,2) x (2,4)
recons += mu
print("First two reconstructed data items are: ")
print(recons[0])
print(recons[1])

print("\nEnd demo ")
This entry was posted in Machine Learning, PyTorch. Bookmark the permalink.

1 Response to Computing PCA Using NumPy Without Scikit

  1. Thorsten Kleppe says:

    After seeing the screenshot, I thought you had cracked PCA. Then to read that the implementation is almost impossible was a bit disappointing. But your screenshot also shows that you are even closer to understanding every single detail.

    It’s very strange with all the other explanations about PCA that seem to fully understand the topic. But apparently more in theory than in practice. One of the great things about your blog are the little behind the scenes looks. Sometimes when such extremely difficult implementations are touted as being so simple, it can become very disappointing to the reader. Especially when you get a high level idea of the technology and then think you can just implement it by yourself from scratch if you learn enough about it. The problem pops up even more in NLP techniques.

    I am working on a spiral dataset that is still waiting for the first test, hope you like the idea. Also a way to come closer to the center.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s