Principal Component Analysis (PCA) From Scratch Using Python

If you have some data with many features, principal component analysis (PCA) is a classical statistics technique that can be used to transform your data to a set with fewer features. This is called dimensionality reduction. For example, suppose you are looking at the MNIST image dataset. Each image has 28 x 28 = 784 features/pixels. This is difficult to work with, so you might want to use PCA to reduce the dimensionality from 784 down to 2 so you could graph the data on a 2D chart.

I coded up a short demo that does PCA using 1.) the scikit library PCA module, and 2.) using a PCA function implemented from scratch. (I got the code from a Web site about t-SNE dimensionality reduction).

I created a dummy dataset with 10 items, where each item has 6 features/columns:

[0.3750, 0.3750, 0.6250, 0.5000, 1.0000, 0.1250]
[0.5000, 1.0000, 0.3125, 0.6875, 1.0000, 0.0000]
[0.9375, 0.7500, 0.8125, 0.1250, 1.0000, 0.7500]
[0.4375, 0.6875, 0.5000, 0.4375, 1.0000, 0.1875]
[0.6875, 0.7500, 0.4375, 0.3750, 1.0000, 0.0625]
[0.5625, 0.6250, 0.5625, 0.2500, 0.9375, 0.0000]
[0.5625, 0.8125, 0.1875, 0.5000, 0.7500, 0.0000]
[0.9375, 0.5000, 0.8125, 0.2500, 1.0000, 0.2500]
[0.7500, 0.5625, 0.8125, 0.3750, 0.8125, 0.0625]
[0.9375, 0.5000, 0.7500, 0.3125, 1.0000, 0.1875]

I reduced the data to dim=2. Using the scikit PCA module I got:

 [-0.1089 -0.3185]
 [-0.5097  0.2176]
 [ 0.625   0.3629]
 [-0.1712  0.0079]
 [-0.1368  0.0754]
 [-0.0845 -0.1206]
 [-0.4504  0.1383]
 [ 0.3954 -0.0828]
 [ 0.1349 -0.1853]
 [ 0.3062 -0.0949]

Using the custom PCA function I got:

 [-0.1089  0.3185]
 [-0.5097 -0.2176]
 [ 0.625  -0.3629]
 [-0.1712 -0.0079]
 [-0.1368 -0.0754]
 [-0.0845  0.1206]
 [-0.4504 -0.1383]
 [ 0.3954  0.0828]
 [ 0.1349  0.1853]
 [ 0.3062  0.0949]

Notice the signs of the second component are reversed. This is annoying but OK. It happens because of the way eigenvalues are computed.

The scikit module has a built-in explained_variance_ratio_ property which gave (0.619, 0.1927). This means the first component explains 61.9% of the variance and the second component explains an additional 19.27% and so the total variance explained by the two components is 81.17%.

To compute the percentage of variance explained using the custom function, I computed PCA for 6 components, the same as the number of features in the source data. Each component (column) has a variance so I computed those and summed them. The percentage of variance explained by the first component is the variance of the first component divided by the sum of the variances.

An alternative approach for dimensionality reduction is to use a neural autoencoder. Using PCA and an autoencoder have pros and cons. PCA is (mostly) deterministic) but gives a slightly weaker representation because components are ordered by the amount of variance they explain. Autoencoders often give a better representation of the source data, but autoencoders are not deterministic in the sense that you can get very different results depending upon the neural architecture and training hyperparameters used.



One way to think about PCA is that it finds a reduced essence of data items. A book nook is a small decorative diorama, about the size of a book, that captures the essence of a concept. Left: A Harry Potter series book nook. Center: A generic book collection book nook. Right: A Hobbit / Lord of the Rings book nook.


Code below.

# pca_demo.py
import numpy as np
from sklearn import decomposition

def pca(X, n_dim):
  means = np.mean(X, axis=0)
  X = X - means
  square_m = np.dot(X.T, X)
  (evals, evecs) = np.linalg.eig(square_m)
  result = np.dot(X, evecs[:,0:n_dim])
  return result

x_data = np.array([
[0.3750, 0.3750, 0.6250, 0.5000, 1.0000, 0.1250],
[0.5000, 1.0000, 0.3125, 0.6875, 1.0000, 0.0000],
[0.9375, 0.7500, 0.8125, 0.1250, 1.0000, 0.7500],
[0.4375, 0.6875, 0.5000, 0.4375, 1.0000, 0.1875],
[0.6875, 0.7500, 0.4375, 0.3750, 1.0000, 0.0625],
[0.5625, 0.6250, 0.5625, 0.2500, 0.9375, 0.0000],
[0.5625, 0.8125, 0.1875, 0.5000, 0.7500, 0.0000],
[0.9375, 0.5000, 0.8125, 0.2500, 1.0000, 0.2500],
[0.7500, 0.5625, 0.8125, 0.3750, 0.8125, 0.0625],
[0.9375, 0.5000, 0.7500, 0.3125, 1.0000, 0.1875]],
 dtype=np.float32)

print("\nBegin PCA demo ")
np.set_printoptions(precision=4, suppress=True)
print("\nSource data: ")
print(x_data)

pca_2 = decomposition.PCA(2) 
pca_2.fit(x_data)
x_reduced = pca_2.transform(x_data)
print("\nComponents from scikit PCA() function: ")
print(x_reduced)
print("\nPercent variance explained via scikit: ")
ve = pca_2.explained_variance_ratio_
print(ve)

x_reduced = pca(x_data, 2)
print("\nComponents from custom pca() function: ")
print(x_reduced)

# compute percent variance explained
x_transform = pca(x_data, 6)  # all columns
v = np.var(x_transform, axis=0, ddof=1)
sv = np.sum(v)
ve = np.array([v[0] / sv, v[1] / sv])
print("\nPercent variance explained computed: ")
print(ve)

print("\nEnd demo ")
This entry was posted in Machine Learning. Bookmark the permalink.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s