Principal Component Analysis

Principal component analysis (PCA) is a well known technique for dimensionality reduction. The idea is best explained by example. Suppose you have a set of 8×8 images where each image is a crude handwritten digit from ‘0’ to ‘9’. This dataset has 64 dimensions and so you can’t visualize it easily. PCA can be used to reduce the dimensionality to 2 “principal components” so the dataset can be graphed (and then examined by a human to see if there are any interesting patterns).

I coded up a short demo. I used the built-in load_digits() function from the scikit-learn library. The dataset has 1797 digits. First I displayed digit [04] which is a four. Then I used PCA to reduce the dimensionality of the entire dataset to 2 so it could be graphed. I’m not all that interested in data visualizations but this was good fun.

import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import load_digits
from sklearn.decomposition import PCA

print("Begin PCA demo")
print("Loading 1797 8x8 digit images into memory")
digits = load_digits()

print("Displaying digit 04 which is a four")
pixels =[4]
pixels = pixels.reshape((8,8))
for i in range(8):
  for j in range(8):
    v =[i,j])
    print("%.2X " % v, end="")
    #print(" ", end="")

# print([4])

print("Displaying digit using pyplot")
img = np.array([4])   # as float32
img = img.reshape((8,8))
plt.imshow(img, cmap=plt.get_cmap('gray_r'))  

print("Using PCA(2) on entire dataset")
pca = PCA(2)  # from 64 to 2 dimensions
projected = pca.fit_transform(

plt.scatter(projected[:, 0], projected[:, 1],
  , edgecolor='none', alpha=0.9,
  'nipy_spectral', 10),
plt.xlabel('component 1')
plt.ylabel('component 2')
This entry was posted in Machine Learning. Bookmark the permalink.