Principal Component Analysis

Principal component analysis (PCA) is a well known technique for dimensionality reduction. The idea is best explained by example. Suppose you have a set of 8×8 images where each image is a crude handwritten digit from ‘0’ to ‘9’. This dataset has 64 dimensions and so you can’t visualize it easily. PCA can be used to reduce the dimensionality to 2 “principal components” so the dataset can be graphed (and then examined by a human to see if there are any interesting patterns).

I coded up a short demo. I used the built-in load_digits() function from the scikit-learn library. The dataset has 1797 digits. First I displayed digit [04] which is a four. Then I used PCA to reduce the dimensionality of the entire dataset to 2 so it could be graphed. I’m not all that interested in data visualizations but this was good fun.

import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import load_digits
from sklearn.decomposition import PCA

print("Begin PCA demo")
print("Loading 1797 8x8 digit images into memory")
digits = load_digits()

print("Displaying digit 04 which is a four")
pixels = digits.data[4]
pixels = pixels.reshape((8,8))
for i in range(8):
  for j in range(8):
    v = np.int(pixels[i,j])
    print("%.2X " % v, end="")
    #print(" ", end="")
  print("")

# print(digits.target[4])

print("Displaying digit using pyplot")
img = np.array(digits.data[4])   # as float32
img = img.reshape((8,8))
plt.imshow(img, cmap=plt.get_cmap('gray_r'))
plt.show()  

print("Using PCA(2) on entire dataset")
pca = PCA(2)  # from 64 to 2 dimensions
projected = pca.fit_transform(digits.data)

plt.scatter(projected[:, 0], projected[:, 1],
            c=digits.target, edgecolor='none', alpha=0.9,
            cmap=plt.cm.get_cmap('nipy_spectral', 10),
            s=80)
plt.xlabel('component 1')
plt.ylabel('component 2')
plt.colorbar()
plt.show()
Advertisements
This entry was posted in Machine Learning. Bookmark the permalink.