Principal component analysis (PCA) is a well known technique for dimensionality reduction. The idea is best explained by example. Suppose you have a set of 8×8 images where each image is a crude handwritten digit from ‘0’ to ‘9’. This dataset has 64 dimensions and so you can’t visualize it easily. PCA can be used to reduce the dimensionality to 2 “principal components” so the dataset can be graphed (and then examined by a human to see if there are any interesting patterns).

I coded up a short demo. I used the built-in load_digits() function from the scikit-learn library. The dataset has 1797 digits. First I displayed digit [04] which is a four. Then I used PCA to reduce the dimensionality of the entire dataset to 2 so it could be graphed. I’m not all that interested in data visualizations but this was good fun.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
print("Begin PCA demo")
print("Loading 1797 8x8 digit images into memory")
digits = load_digits()
print("Displaying digit 04 which is a four")
pixels = digits.data[4]
pixels = pixels.reshape((8,8))
for i in range(8):
for j in range(8):
v = np.int(pixels[i,j])
print("%.2X " % v, end="")
#print(" ", end="")
print("")
# print(digits.target[4])
print("Displaying digit using pyplot")
img = np.array(digits.data[4]) # as float32
img = img.reshape((8,8))
plt.imshow(img, cmap=plt.get_cmap('gray_r'))
plt.show()
print("Using PCA(2) on entire dataset")
pca = PCA(2) # from 64 to 2 dimensions
projected = pca.fit_transform(digits.data)
plt.scatter(projected[:, 0], projected[:, 1],
c=digits.target, edgecolor='none', alpha=0.9,
cmap=plt.cm.get_cmap('nipy_spectral', 10),
s=80)
plt.xlabel('component 1')
plt.ylabel('component 2')
plt.colorbar()
plt.show()

### Like this:

Like Loading...

*Related*