Principal component analysis (PCA) is a classical statistics technique for dimensionality reduction. For example, the Penguin Dataset has 333 valid (no missing values) items. Each item has four numeric predictor variables (bill length, bill width, flipper length, body weight) and a species to predict (0 = Adelie, 1 = Chinstrap, 2 = Gentoo). In some scenarios, instead of dealing with all four predictors, you might want to reduce the data to just two predictors. Of course, this is just an example. A more realistic example would be a very large dataset with many predictor variables.
The classical technique for PCA looks like:
1. load data 2. normalize data 3. compute covariance matrix from normalized data 4. compute eigenvalues and eigenvectors directly from covariance matrix 5. sort eigenvalues from large to small 6. determine how many eigenvalues to use 7. extract associated eigenvectors 8. use eigenvectors to reduce data
In the classical technique, in step #4, eigenvalues and eigenvectors are computed directly from the covariance matrix, typically by using the QR decomposition algorithm, or by using the Jacobi algorithm.
I noticed that the scikit library PCA module documentation states that singular value decomposition (SVD) is used instead of computing eigenvalues and eigenvectors directly.
If you have a matrix A and apply SVD, you get a matrix U, a vector s of “singular values”, and a matrix Vh so that A = U * S * Vh where S is a matrix with the s vector values on the diagonal. To use SVD for PCA, you apply SVD directly to the normalized source data. The eigenvectors are stored in U. To compute the eigenvectors from the singular values, the computation is eigval(i) = s(i)^2 / (n-1) where n is the number of rows in the source data.
Put another way, step #4 in PCA becomes:
4. compute eigenvalues and eigenvectors indirectly from SVD applied to normalized data (using s and U results)
I had examples of classical PCA from scratch using C# (with the QR algorithm, and with the Jacobi algorithm, for computing eigenvalues and eigenvectors) but I decided to put together an example of PCA from scratch using C# with the SVD approach. I wanted to verify my knowledge.
The classical code for step #4 look like:
double[] eigenVals; double[][] eigenVecs; Eigen(covarMat, out eigenVals, out eigenVecs); // QR or Jacobi
The SVD version for step #4 looks like:
double[][] U; double[][] Vh; double[] s; int n = stdX.Length; int dim = 4; SVD_Jacobi(MatTranspose(stdX), out U, out Vh, out s); // compute eigenvalues from singular values double[] eigenVals = new double[dim]; for (int i = 0; i "lt"; dim; ++i) eigenVals[i] = (s[i] * s[i]) / (n - 1); // eigenvectors are in U double[][] eigenVecs = MatExtract(U, dim, dim);
I ran my from-scratch C# PCA using SVD version and a scikit PCA demo on a 9-item subset of the Penguin Dataset. Both programs gave identical results (except for the signs of eigenvectors, which is OK). The source data is:
# penguin_9.txt # species: 0 = Adelie, 1 = Chinstrap, 2 = Gentoo # bill length (mm), bill depth (mm), # flipper length (mm), body mass (g) # 0, 39.1, 18.7, 181.0, 3750.0 0, 39.5, 17.4, 186.0, 3800.0 0, 40.3, 18.0, 195.0, 3250.0 1, 46.5, 17.9, 192.0, 3500.0 1, 50.0, 19.5, 196.0, 3900.0 1, 51.3, 19.2, 193.0, 3650.0 2, 46.1, 13.2, 211.0, 4500.0 2, 50.0, 16.3, 230.0, 5700.0 2, 48.7, 14.1, 210.0, 4450.0
The SVD version of PCA is often preferred to the classical direct eigenvalues and eigenvectors version. A few Internet resources say something along the lines of, “SVD is more efficient than directly computing eigenvectors” or “Computing SVD singular values is more stable than computing eigenvalues directly”, but I haven’t seen any hard research evidence that demonstrates this. (There probably is such evidence but I just haven’t seen it).
Good fun.
Three images from an Internet search for “minimalist portrait”, ordered by increasing dimensionality reduction. The middle portrait is by famous artist Ichiro Tsuruta. The other two are anonymous.
Demo code for the PCA using SVD. Replace “lt”, “gt”, “lte”, “gte”, “and” with Boolean operator symbols.
using System; using System.IO; namespace PrincipalComponentSVD { internal class PrincipalComponentProgram { static void Main(string[] args) { Console.WriteLine("\nBegin PCA from scratch C#" + " using SVD technique "); // 0. load data Console.WriteLine("\nLoading Penguin-9 data "); string dataFile = "..\\..\\..\\Data\\penguin_9.txt"; double[][] X = MatLoad(dataFile, new int[] { 1, 2, 3, 4 }, ',', "#"); Console.WriteLine("Done "); Console.WriteLine("\nSource data: "); MatShow(X, 1, 9, true); // 1. standardize data Console.WriteLine("\nStandardizing (z-score," + " biased) "); double[] means; double[] stds; double[][] stdX = MatStandardize(X, out means, out stds); Console.WriteLine("\nStandardized data items "); MatShow(stdX, 4, 9, nRows: 3); Console.WriteLine("\nComputing " + " eigenvalues and eigenvectors using SVD: "); // 2. eigenvalues eigenvectors from standardized data // SVD approach // eigenvectors are in U, singular vals are in s double[][] U; double[][] Vh; double[] s; int n = stdX.Length; int dim = stdX[0].Length; SVD_Jacobi(MatTranspose(stdX), out U, out Vh, out s); // compute eigenvalues from singular values double[] eigenVals = new double[dim]; for (int i = 0; i "lt" dim; ++i) eigenVals[i] = (s[i] * s[i]) / (n - 1); // eigenvectors are in U double[][] eigenVecs = MatExtract(U, dim, dim); // double[][] eigenVecs = U; // works too // double[][] eigenVecs = MatCopy(U); // works too Console.WriteLine("\nEigenvectors (not sorted, " + "as cols): "); MatShow(eigenVecs, 4, 9, false); Console.WriteLine("\nEigenvalues, not sorted: "); VecShow(eigenVals, 4, 9); // 3. sort eigenvals from large to smallest int[] idxs = ArgSort(eigenVals); // to sort evecs Array.Reverse(idxs); Array.Sort(eigenVals); Array.Reverse(eigenVals); Console.WriteLine("\nEigenvalues (sorted): "); VecShow(eigenVals, 4, 9); // 4. sort eigenvectors eigenVecs = MatExtractCols(eigenVecs, idxs); // sort eigenVecs = MatTranspose(eigenVecs); // scikit format Console.WriteLine("\nEigenvectors (sorted, rows): "); MatShow(eigenVecs, 4, 9, false); // 5. variance explained to guide number components Console.WriteLine("\nComputing variance" + " explained: "); double[] varExplained = VarExplained(eigenVals); VecShow(varExplained, 4, 9); // 6. transform all data Console.WriteLine("\nComputing transformed" + " data (all components): "); double[][] transformed = MatProduct(stdX, MatTranspose(eigenVecs)); // all MatShow(transformed, 4, 9, true); // 7. reconstruct data just for fun / to verify Console.WriteLine("\nReconstructing " + " original data: "); double[][] reconstructed = MatProduct(transformed, eigenVecs); for (int i = 0; i "lt" reconstructed.Length; ++i) for (int j = 0; j "lt" reconstructed[0].Length; ++j) reconstructed[i][j] = (reconstructed[i][j] * stds[j]) + means[j]; MatShow(reconstructed, 1, 9, 3); Console.WriteLine("\nEnd demo "); Console.ReadLine(); }// Main // ====================================================== static void SVD_Jacobi(double[][] mat, out double[][] U, out double[][] Vh, out double[] s) { // works for m != n double EPSILON = 1.0e-15; double[][] A = MatCopy(mat); // working U int m = A.Length; int n = A[0].Length; // check m == n double[][] Q = MatIdentity(n); // working V double[] t = new double[n]; // working s int ct = 1; // rotation counter int pass = 0; double tol = 10 * n * EPSILON; // heuristic int passMax = 5 * n; if (passMax "lt" 15) passMax = 15; // heuristic // save the column error estimates for (int j = 0; j "lt" n; ++j) { double[] cj = MatGetColumn(A, j); double sj = VecNorm(cj); t[j] = EPSILON * sj; } while (ct "gt" 0 "and" pass "lte" passMax) { ct = n * (n - 1) / 2; // rotation counter for (int j = 0; j "lt" n - 1; ++j) { for (int k = j + 1; k "lt" n; ++k) { double sin; double cos; double[] cj = MatGetColumn(A, j); double[] ck = MatGetColumn(A, k); double p = 2.0 * VecDot(cj, ck); double a = VecNorm(cj); double b = VecNorm(ck); double q = a * a - b * b; double v = Hypot(p, q); double errorA = t[j]; double errorB = t[k]; bool sorted = false; if (a "gte" b) sorted = true; bool orthog = false; if (Math.Abs(p) "lte" tol * (a * b)) orthog = true; bool badA = false; if (a "lt" errorA) badA = true; bool badB = false; if (b "lt" errorB) badB = true; if (sorted == true "and" (orthog == true || badA == true || badB == true)) { --ct; continue; } // compute rotation angles if (v == 0 || sorted == false) { cos = 0.0; sin = 1.0; } else { cos = Math.Sqrt((v + q) / (2.0 * v)); sin = p / (2.0 * v * cos); } // apply rotation to A (U) for (int i = 0; i "lt" m; ++i) { double Aik = A[i][k]; double Aij = A[i][j]; A[i][j] = Aij * cos + Aik * sin; A[i][k] = -Aij * sin + Aik * cos; } // update singular values t[j] = Math.Abs(cos) * errorA + Math.Abs(sin) * errorB; t[k] = Math.Abs(sin) * errorA + Math.Abs(cos) * errorB; // apply rotation to Q (V) for (int i = 0; i "lt" n; ++i) { double Qij = Q[i][j]; double Qik = Q[i][k]; Q[i][j] = Qij * cos + Qik * sin; Q[i][k] = -Qij * sin + Qik * cos; } // i } // k } // j ++pass; } // while // compute singular values double prevNorm = -1.0; for (int j = 0; j "lt" n; ++j) { double[] column = MatGetColumn(A, j); double norm = VecNorm(column); // check if singular value is zero if (norm == 0.0 || prevNorm == 0.0 || (j "gt" 0 "and" norm "lte" tol * prevNorm)) { t[j] = 0.0; for (int i = 0; i "lt" m; ++i) A[i][j] = 0.0; prevNorm = 0.0; } else { t[j] = norm; for (int i = 0; i "lt" m; ++i) A[i][j] = A[i][j] * 1.0 / norm; prevNorm = norm; } } if (ct "gt" 0) { Console.WriteLine("Jacobi iterations did not" + " converge"); } U = A; Vh = MatTranspose(Q); s = t; // to sync with default np.linalg.svd() shapes: // if m "lt" n, extract 1st m columns of U // extract 1st m values of s // extract 1st m rows of Vh // not applicable for matrix inverse. // if (m "lt" n) // { // U = MatExtractFirstColumns(U, m); // s = VecExtractFirst(s, m); // Vh = MatExtractFirstRows(Vh, m); // } } // MatDecomposeSVD // ====================================================== // --- helper functions --------------------------------- // // MatMake, MatCopy, MatIdentity, MatLoad, // MatExtractFirstColumns, MatExtractFirstRows, // MatStandardize, MatGetColumn, MatTranspose, // MatProduct, MatNormColumns, MatExtractCols, VecNorm, // VecDot, VecExtractFirst, Hypot, ArgSort, // MatShow, VecShow // // ------------------------------------------------------ static double[][] MatMake(int r, int c) { double[][] result = new double[r][]; for (int i = 0; i "lt" r; ++i) result[i] = new double[c]; return result; } static double[][] MatCopy(double[][] m) { int r = m.Length; int c = m[0].Length; double[][] result = MatMake(r, c); for (int i = 0; i "lt" r; ++i) for (int j = 0; j "lt" c; ++j) result[i][j] = m[i][j]; return result; } static double[][] MatIdentity(int n) { double[][] result = MatMake(n, n); for (int i = 0; i "lt" n; ++i) result[i][i] = 1.0; return result; } static double[][] MatLoad(string fn, int[] usecols, char sep, string comment) { // count number of non-comment lines, self-contained //int nRows = NumNonCommentLines(fn, comment); int nRows = 0; string line = ""; FileStream ifs = new FileStream(fn, FileMode.Open); StreamReader sr = new StreamReader(ifs); while ((line = sr.ReadLine()) != null) if (line.StartsWith(comment) == false) ++nRows; sr.Close(); ifs.Close(); int nCols = usecols.Length; // double[][] result = MatCreate(nRows, nCols); double[][] result = new double[nRows][]; for (int r = 0; r "lt" nRows; ++r) result[r] = new double[nCols]; line = ""; string[] tokens = null; ifs = new FileStream(fn, FileMode.Open); sr = new StreamReader(ifs); int i = 0; while ((line = sr.ReadLine()) != null) { if (line.StartsWith(comment) == true) continue; tokens = line.Split(sep); for (int j = 0; j "lt" nCols; ++j) { int k = usecols[j]; // into tokens result[i][j] = double.Parse(tokens[k]); } ++i; } sr.Close(); ifs.Close(); return result; } static double[][] MatExtract(double[][] src, int r, int c) { // extract upper left corner double[][] result = MatMake(r, c); for (int i = 0; i "lt" r; ++i) for (int j = 0; j "lt" c; ++j) result[i][j] = src[i][j]; return result; } static double[][] MatExtractFirstColumns(double[][] src, int n) { int nRows = src.Length; // int nCols = src[0].Length; double[][] result = MatMake(nRows, n); for (int i = 0; i "lt" nRows; ++i) for (int j = 0; j "lt" n; ++j) result[i][j] = src[i][j]; return result; } static double[][] MatExtractFirstRows(double[][] src, int n) { // int nRows = src.Length; int nCols = src[0].Length; double[][] result = MatMake(n, nCols); for (int i = 0; i "lt" n; ++i) for (int j = 0; j "lt" nCols; ++j) result[i][j] = src[i][j]; return result; } static double[][] MatStandardize(double[][] data, out double[] means, out double[] stds) { // scikit style z-score biased normalization int rows = data.Length; int cols = data[0].Length; double[][] result = MatMake(rows, cols); // compute means double[] mns = new double[cols]; for (int j = 0; j "lt" cols; ++j) { double sum = 0.0; for (int i = 0; i "lt" rows; ++i) sum += data[i][j]; mns[j] = sum / rows; } // j // compute std devs double[] sds = new double[cols]; for (int j = 0; j "lt" cols; ++j) { double sum = 0.0; for (int i = 0; i "lt" rows; ++i) sum += (data[i][j] - mns[j]) * (data[i][j] - mns[j]); sds[j] = Math.Sqrt(sum / rows); // biased } // j // normalize for (int j = 0; j "lt" cols; ++j) { for (int i = 0; i "lt" rows; ++i) result[i][j] = (data[i][j] - mns[j]) / sds[j]; } // j means = mns; stds = sds; return result; } static double[] MatGetColumn(double[][] m, int j) { int rows = m.Length; double[] result = new double[rows]; for (int i = 0; i "lt" rows; ++i) result[i] = m[i][j]; return result; } static double[][] MatTranspose(double[][] m) { int r = m.Length; int c = m[0].Length; double[][] result = MatMake(c, r); for (int i = 0; i "lt" r; ++i) for (int j = 0; j "lt" c; ++j) result[j][i] = m[i][j]; return result; } static double[][] MatProduct(double[][] matA, double[][] matB) { int aRows = matA.Length; int aCols = matA[0].Length; int bRows = matB.Length; int bCols = matB[0].Length; if (aCols != bRows) throw new Exception("Non-conformable matrices"); double[][] result = MatMake(aRows, bCols); for (int i = 0; i "lt" aRows; ++i) // each row of A for (int j = 0; j "lt" bCols; ++j) // each col of B for (int k = 0; k "lt" aCols; ++k) result[i][j] += matA[i][k] * matB[k][j]; return result; } static double[][] MatExtractCols(double[][] mat, int[] cols) { int srcRows = mat.Length; int srcCols = mat[0].Length; int tgtCols = cols.Length; double[][] result = MatMake(srcRows, tgtCols); for (int i = 0; i "lt" srcRows; ++i) { for (int j = 0; j "lt" tgtCols; ++j) { int c = cols[j]; result[i][j] = mat[i][c]; } } return result; } static double VecNorm(double[] vec) { double sum = 0.0; int n = vec.Length; for (int i = 0; i "lt" n; ++i) sum += vec[i] * vec[i]; return Math.Sqrt(sum); } static double VecDot(double[] v1, double[] v2) { int n = v1.Length; double sum = 0.0; for (int i = 0; i "lt" n; ++i) sum += v1[i] * v2[i]; return sum; } static double[] VecExtractFirst(double[] vec, int n) { double[] result = new double[n]; for (int i = 0; i "lt" n; ++i) result[i] = vec[i]; return result; } static double Hypot(double x, double y) { // fancy sqrt(x^2 + y^2) double xabs = Math.Abs(x); double yabs = Math.Abs(y); double min, max; if (xabs "lt" yabs) { min = xabs; max = yabs; } else { min = yabs; max = xabs; } if (min == 0) return max; double u = min / max; return max * Math.Sqrt(1 + u * u); } static int[] ArgSort(double[] vec) { int n = vec.Length; int[] idxs = new int[n]; for (int i = 0; i "lt" n; ++i) idxs[i] = i; Array.Sort(vec, idxs); // sort idxs on vec vals return idxs; } // ------------------------------------------------------ static void MatShow(double[][] M, int dec, int wid, bool showIndices) { double small = 1.0 / Math.Pow(10, dec); for (int i = 0; i "lt" M.Length; ++i) { if (showIndices == true) { int pad = M.Length.ToString().Length; Console.Write("[" + i.ToString(). PadLeft(pad) + "]"); } for (int j = 0; j "lt" M[0].Length; ++j) { double v = M[i][j]; if (Math.Abs(v) "lt" small) v = 0.0; Console.Write(v.ToString("F" + dec). PadLeft(wid)); } Console.WriteLine(""); } } // ------------------------------------------------------ static void MatShow(double[][] M, int dec, int wid, int nRows) { double small = 1.0 / Math.Pow(10, dec); for (int i = 0; i "lt" nRows; ++i) { for (int j = 0; j "lt" M[0].Length; ++j) { double v = M[i][j]; if (Math.Abs(v) "lt" small) v = 0.0; Console.Write(v.ToString("F" + dec). PadLeft(wid)); } Console.WriteLine(""); } if (nRows "lt" M.Length) Console.WriteLine(". . . "); } // ------------------------------------------------------ static void VecShow(double[] vec, int dec, int wid) { for (int i = 0; i "lt" vec.Length; ++i) { double x = vec[i]; if (Math.Abs(x) "lt" 1.0e-8) x = 0.0; Console.Write(x.ToString("F" + dec). PadLeft(wid)); } Console.WriteLine(""); } static void VecShow(int[] vec, int wid) { for (int i = 0; i "lt" vec.Length; ++i) { int x = vec[i]; Console.Write(x.ToString(). PadLeft(wid)); } Console.WriteLine(""); } // ------------------------------------------------------ static double[] VarExplained(double[] eigenVals) { // assumes eigenVals are sorted large to small int n = eigenVals.Length; double[] result = new double[n]; double sum = 0.0; for (int i = 0; i "lt" n; ++i) sum += eigenVals[i]; for (int i = 0; i "lt" n; ++i) { double pctExplained = eigenVals[i] / sum; result[i] = pctExplained; } return result; } } // Program } // ns
Here’s the scikit version:
# iris_pca.py # demo of PCA on the Iris subset import numpy as np from sklearn.decomposition import PCA from sklearn.preprocessing import StandardScaler # def mean(x): # np.mean(X, axis = 0) # return sum(x)/len(x) # def std(x): # np.std(X, axis = 0) # return (sum((i - mean(x))**2 for i in x)/len(x))**0.5 # def Standardize_data(X): # return (X - mean(X))/std(X) print("\nBegin PCA using scikit ") np.set_printoptions(precision=4, suppress=True, floatmode='fixed') print("\nLoading subset of Penguin data ") data_file = ".\\Data\\penguin_9.txt" X = np.loadtxt(data_file, delimiter=",", usecols=[1,2,3,4], comments="#", dtype=np.float64) print("Done ") print("\nSource data: ") print(X) print("\nStandardizing (z-score, biased) ") scaler = StandardScaler() scaler.fit(X) X_std = scaler.transform(X) # print(scaler.mean_) # print(np.sqrt(scaler.var_)) print("\nStandardized data: ") print(X_std[0:3,:]) print(". . . ") pca = PCA(n_components=4) pca.fit(X_std) # covar_mat = pca.get_covariance() # print("\ncovariance matrix: ") # print(covar_mat) eVals = pca.explained_variance_ print("\nEigenvalues (sorted): ") print(eVals) print("\nEigenvectors (sorted, rows): ") eVecs = pca.components_ print(eVecs) pctVar = pca.explained_variance_ratio_ print("\npercent variance explained: ") print(pctVar) transformed = pca.transform(X_std) print("\nTransformed data (all components): ") print(transformed) # compute transformed from eigenvectors # trans_check = np.matmul(X_std, eVecs.T) # print("\nFirst 3 transformed verify: ") # print(trans_check[0:3,0:2]) print("\nReconstructing " + \ " original data: "); reconstructed = np.matmul(transformed, eVecs) reconstructed *= np.sqrt(scaler.var_) reconstructed += scaler.mean_ print(reconstructed[0:3,:]) print(". . . ") print("\nEnd demo ")
You must be logged in to post a comment.