Principal Component Analysis (PCA) Using the SVD Technique From Scratch Using C#

Principal component analysis (PCA) is a classical statistics technique for dimensionality reduction. For example, the Penguin Dataset has 333 valid (no missing values) items. Each item has four numeric predictor variables (bill length, bill width, flipper length, body weight) and a species to predict (0 = Adelie, 1 = Chinstrap, 2 = Gentoo). In some scenarios, instead of dealing with all four predictors, you might want to reduce the data to just two predictors. Of course, this is just an example. A more realistic example would be a very large dataset with many predictor variables.

The classical technique for PCA looks like:

1. load data
2. normalize data
3. compute covariance matrix from normalized data
4. compute eigenvalues and eigenvectors directly
   from covariance matrix
5. sort eigenvalues from large to small
6. determine how many eigenvalues to use
7. extract associated eigenvectors
8. use eigenvectors to reduce data

In the classical technique, in step #4, eigenvalues and eigenvectors are computed directly from the covariance matrix, typically by using the QR decomposition algorithm, or by using the Jacobi algorithm.

I noticed that the scikit library PCA module documentation states that singular value decomposition (SVD) is used instead of computing eigenvalues and eigenvectors directly.

If you have a matrix A and apply SVD, you get a matrix U, a vector s of “singular values”, and a matrix Vh so that A = U * S * Vh where S is a matrix with the s vector values on the diagonal. To use SVD for PCA, you apply SVD directly to the normalized source data. The eigenvectors are stored in U. To compute the eigenvectors from the singular values, the computation is eigval(i) = s(i)^2 / (n-1) where n is the number of rows in the source data.

Put another way, step #4 in PCA becomes:

4. compute eigenvalues and eigenvectors indirectly
   from SVD applied to normalized data (using s and U results)

I had examples of classical PCA from scratch using C# (with the QR algorithm, and with the Jacobi algorithm, for computing eigenvalues and eigenvectors) but I decided to put together an example of PCA from scratch using C# with the SVD approach. I wanted to verify my knowledge.

The classical code for step #4 look like:

double[] eigenVals;
double[][] eigenVecs;
Eigen(covarMat, out eigenVals, out eigenVecs); // QR or Jacobi

The SVD version for step #4 looks like:

double[][] U;
double[][] Vh;
double[] s;

int n = stdX.Length; int dim = 4;
SVD_Jacobi(MatTranspose(stdX), out U, out Vh, out s); 

// compute eigenvalues from singular values
double[] eigenVals = new double[dim];
for (int i = 0; i "lt"; dim; ++i)
  eigenVals[i] = (s[i] * s[i]) / (n - 1);

// eigenvectors are in U
double[][] eigenVecs = MatExtract(U, dim, dim);

I ran my from-scratch C# PCA using SVD version and a scikit PCA demo on a 9-item subset of the Penguin Dataset. Both programs gave identical results (except for the signs of eigenvectors, which is OK). The source data is:

# penguin_9.txt
# species: 0 = Adelie, 1 = Chinstrap, 2 = Gentoo
# bill length (mm), bill depth (mm),
# flipper length (mm), body mass (g)
#
0, 39.1, 18.7, 181.0, 3750.0
0, 39.5, 17.4, 186.0, 3800.0
0, 40.3, 18.0, 195.0, 3250.0
1, 46.5, 17.9, 192.0, 3500.0
1, 50.0, 19.5, 196.0, 3900.0
1, 51.3, 19.2, 193.0, 3650.0
2, 46.1, 13.2, 211.0, 4500.0
2, 50.0, 16.3, 230.0, 5700.0
2, 48.7, 14.1, 210.0, 4450.0

The SVD version of PCA is often preferred to the classical direct eigenvalues and eigenvectors version. A few Internet resources say something along the lines of, “SVD is more efficient than directly computing eigenvectors” or “Computing SVD singular values is more stable than computing eigenvalues directly”, but I haven’t seen any hard research evidence that demonstrates this. (There probably is such evidence but I just haven’t seen it).

Good fun.

Three images from an Internet search for “minimalist portrait”, ordered by increasing dimensionality reduction. The middle portrait is by famous artist Ichiro Tsuruta. The other two are anonymous.

Demo code for the PCA using SVD. Replace “lt”, “gt”, “lte”, “gte”, “and” with Boolean operator symbols.

using System;
using System.IO;

namespace PrincipalComponentSVD
{
  internal class PrincipalComponentProgram
  {
    static void Main(string[] args)
    {
      Console.WriteLine("\nBegin PCA from scratch C#" +
        " using SVD technique ");

      // 0. load data
      Console.WriteLine("\nLoading Penguin-9 data ");
      string dataFile =
        "..\\..\\..\\Data\\penguin_9.txt";
      double[][] X = MatLoad(dataFile,
        new int[] { 1, 2, 3, 4 }, ',', "#");
      Console.WriteLine("Done ");

      Console.WriteLine("\nSource data: ");
      MatShow(X, 1, 9, true);

      // 1. standardize data
      Console.WriteLine("\nStandardizing (z-score," +
        " biased) ");
      double[] means;
      double[] stds;
      double[][] stdX = MatStandardize(X,
        out means, out stds);

      Console.WriteLine("\nStandardized data items ");
      MatShow(stdX, 4, 9, nRows: 3);
  
      Console.WriteLine("\nComputing " +
        " eigenvalues and eigenvectors using SVD: ");

      // 2. eigenvalues eigenvectors from standardized data
      // SVD approach
      // eigenvectors are in U, singular vals are in s
      double[][] U;
      double[][] Vh;
      double[] s;

      int n = stdX.Length;
      int dim = stdX[0].Length;
      SVD_Jacobi(MatTranspose(stdX), out U, out Vh, out s);

      // compute eigenvalues from singular values
      double[] eigenVals = new double[dim];
      for (int i = 0; i "lt" dim; ++i)
        eigenVals[i] = (s[i] * s[i]) / (n - 1);

      // eigenvectors are in U
      double[][] eigenVecs = MatExtract(U, dim, dim);
      // double[][] eigenVecs = U;  // works too
      // double[][] eigenVecs = MatCopy(U);  // works too
      Console.WriteLine("\nEigenvectors (not sorted, " +
        "as cols): ");
      MatShow(eigenVecs, 4, 9, false);

      Console.WriteLine("\nEigenvalues, not sorted: ");
      VecShow(eigenVals, 4, 9);

      // 3. sort eigenvals from large to smallest
      int[] idxs = ArgSort(eigenVals);  // to sort evecs
      Array.Reverse(idxs);
      Array.Sort(eigenVals);
      Array.Reverse(eigenVals);
      Console.WriteLine("\nEigenvalues (sorted): ");
      VecShow(eigenVals, 4, 9);

      // 4. sort eigenvectors
      eigenVecs = MatExtractCols(eigenVecs, idxs);  // sort 
      eigenVecs = MatTranspose(eigenVecs);  // scikit format
      Console.WriteLine("\nEigenvectors (sorted, rows): ");
      MatShow(eigenVecs, 4, 9, false);

      // 5. variance explained to guide number components
      Console.WriteLine("\nComputing variance" +
        " explained: ");
      double[] varExplained = VarExplained(eigenVals);
      VecShow(varExplained, 4, 9);

      // 6. transform all data
      Console.WriteLine("\nComputing transformed" +
        " data (all components): ");
      double[][] transformed =
        MatProduct(stdX, MatTranspose(eigenVecs));  // all 
      MatShow(transformed, 4, 9, true);

      // 7. reconstruct data just for fun / to verify
      Console.WriteLine("\nReconstructing " +
       " original data: ");
      double[][] reconstructed =
        MatProduct(transformed, eigenVecs);
      for (int i = 0; i "lt" reconstructed.Length; ++i)
        for (int j = 0; j "lt" reconstructed[0].Length; ++j)
          reconstructed[i][j] =
            (reconstructed[i][j] * stds[j]) + means[j];
      MatShow(reconstructed, 1, 9, 3);

      Console.WriteLine("\nEnd demo ");
      Console.ReadLine();
    }// Main

    // ======================================================

    static void SVD_Jacobi(double[][] mat, out double[][] U,
         out double[][] Vh, out double[] s)
    {
      // works for m != n
      double EPSILON = 1.0e-15;

      double[][] A = MatCopy(mat); // working U
      int m = A.Length;
      int n = A[0].Length;  // check m == n
      double[][] Q = MatIdentity(n); // working V
      double[] t = new double[n];  // working s

      int ct = 1;  // rotation counter
      int pass = 0;
      double tol = 10 * n * EPSILON; // heuristic

      int passMax = 5 * n;
      if (passMax "lt" 15) passMax = 15; // heuristic

      // save the column error estimates
      for (int j = 0; j "lt" n; ++j)
      {
        double[] cj = MatGetColumn(A, j);
        double sj = VecNorm(cj);
        t[j] = EPSILON * sj;
      }

      while (ct "gt" 0 "and" pass "lte" passMax)
      {
        ct = n * (n - 1) / 2;  // rotation counter
        for (int j = 0; j "lt" n - 1; ++j)
        {
          for (int k = j + 1; k "lt" n; ++k)
          {
            double sin; double cos;

            double[] cj = MatGetColumn(A, j);
            double[] ck = MatGetColumn(A, k);

            double p = 2.0 * VecDot(cj, ck);
            double a = VecNorm(cj);
            double b = VecNorm(ck);

            double q = a * a - b * b;
            double v = Hypot(p, q);

            double errorA = t[j];
            double errorB = t[k];

            bool sorted = false;
            if (a "gte" b) sorted = true;

            bool orthog = false;
            if (Math.Abs(p) "lte" tol * (a * b)) orthog = true;

            bool badA = false;
            if (a "lt" errorA) badA = true;
            bool badB = false;
            if (b "lt" errorB) badB = true;

            if (sorted == true "and" (orthog == true ||
              badA == true || badB == true))
            {
              --ct;
              continue;
            }

            // compute rotation angles
            if (v == 0 || sorted == false)
            {
              cos = 0.0;
              sin = 1.0;
            }
            else
            {
              cos = Math.Sqrt((v + q) / (2.0 * v));
              sin = p / (2.0 * v * cos);
            }

            // apply rotation to A (U)
            for (int i = 0; i "lt" m; ++i)
            {
              double Aik = A[i][k];
              double Aij = A[i][j];
              A[i][j] = Aij * cos + Aik * sin;
              A[i][k] = -Aij * sin + Aik * cos;
            }

            // update singular values
            t[j] = Math.Abs(cos) * errorA + 
              Math.Abs(sin) * errorB;
            t[k] = Math.Abs(sin) * errorA + 
              Math.Abs(cos) * errorB;

            // apply rotation to Q (V)
            for (int i = 0; i "lt" n; ++i)
            {
              double Qij = Q[i][j];
              double Qik = Q[i][k];
              Q[i][j] = Qij * cos + Qik * sin;
              Q[i][k] = -Qij * sin + Qik * cos;
            } // i
          } // k
        } // j

        ++pass;
      } // while

      //  compute singular values
      double prevNorm = -1.0;
      for (int j = 0; j "lt" n; ++j)
      {
        double[] column = MatGetColumn(A, j);
        double norm = VecNorm(column);

        // check if singular value is zero
        if (norm == 0.0 || prevNorm == 0.0
          || (j "gt" 0 "and" norm "lte" tol * prevNorm))
        {
          t[j] = 0.0;
          for (int i = 0; i "lt" m; ++i)
            A[i][j] = 0.0;
          prevNorm = 0.0;
        }
        else
        {
          t[j] = norm;
          for (int i = 0; i "lt" m; ++i)
            A[i][j] = A[i][j] * 1.0 / norm;
          prevNorm = norm;
        }
      }

      if (ct "gt" 0)
      {
        Console.WriteLine("Jacobi iterations did not" +
          " converge");
      }

      U = A;
      Vh = MatTranspose(Q);
      s = t;

      // to sync with default np.linalg.svd() shapes:
      // if m "lt" n, extract 1st m columns of U
      //   extract 1st m values of s
      //   extract 1st m rows of Vh
      // not applicable for matrix inverse.

      // if (m "lt" n)
      // {
      //   U = MatExtractFirstColumns(U, m);
      //   s = VecExtractFirst(s, m);
      //   Vh = MatExtractFirstRows(Vh, m);
      // }

    } // MatDecomposeSVD

    // ======================================================

    // --- helper functions ---------------------------------
    //
    // MatMake, MatCopy, MatIdentity, MatLoad, 
    // MatExtractFirstColumns, MatExtractFirstRows, 
    // MatStandardize, MatGetColumn, MatTranspose,
    // MatProduct, MatNormColumns, MatExtractCols, VecNorm,
    // VecDot, VecExtractFirst, Hypot, ArgSort,
    // MatShow, VecShow
    //
    // ------------------------------------------------------

    static double[][] MatMake(int r, int c)
    {
      double[][] result = new double[r][];
      for (int i = 0; i "lt" r; ++i)
        result[i] = new double[c];
      return result;
    }

    static double[][] MatCopy(double[][] m)
    {
      int r = m.Length; int c = m[0].Length;
      double[][] result = MatMake(r, c);
      for (int i = 0; i "lt" r; ++i)
        for (int j = 0; j "lt" c; ++j)
          result[i][j] = m[i][j];
      return result;
    }

    static double[][] MatIdentity(int n)
    {
      double[][] result = MatMake(n, n);
      for (int i = 0; i "lt" n; ++i)
        result[i][i] = 1.0;
      return result;
    }

    static double[][] MatLoad(string fn, int[] usecols,
      char sep, string comment)
    {
      // count number of non-comment lines, self-contained
      //int nRows = NumNonCommentLines(fn, comment);
      int nRows = 0;
      string line = "";
      FileStream ifs = new FileStream(fn, FileMode.Open);
      StreamReader sr = new StreamReader(ifs);
      while ((line = sr.ReadLine()) != null)
        if (line.StartsWith(comment) == false)
          ++nRows;
      sr.Close(); ifs.Close();

      int nCols = usecols.Length;
      // double[][] result = MatCreate(nRows, nCols);
      double[][] result = new double[nRows][];
      for (int r = 0; r "lt" nRows; ++r)
        result[r] = new double[nCols];

      line = "";
      string[] tokens = null;
      ifs = new FileStream(fn, FileMode.Open);
      sr = new StreamReader(ifs);

      int i = 0;
      while ((line = sr.ReadLine()) != null)
      {
        if (line.StartsWith(comment) == true)
          continue;
        tokens = line.Split(sep);
        for (int j = 0; j "lt" nCols; ++j)
        {
          int k = usecols[j];  // into tokens
          result[i][j] = double.Parse(tokens[k]);
        }
        ++i;
      }
      sr.Close(); ifs.Close();
      return result;
    }

    static double[][] MatExtract(double[][] src,
      int r, int c)
    {
      // extract upper left corner
      double[][] result = MatMake(r, c);
      for (int i = 0; i "lt" r; ++i)
        for (int j = 0; j "lt" c; ++j)
          result[i][j] = src[i][j];
      return result;
    }

    static double[][] 
      MatExtractFirstColumns(double[][] src, int n)
    {
      int nRows = src.Length;
      // int nCols = src[0].Length;
      double[][] result = MatMake(nRows, n);
      for (int i = 0; i "lt" nRows; ++i)
        for (int j = 0; j "lt" n; ++j)
          result[i][j] = src[i][j];
      return result;
    }

    static double[][] 
      MatExtractFirstRows(double[][] src, int n)
    {
      // int nRows = src.Length;
      int nCols = src[0].Length;
      double[][] result = MatMake(n, nCols);
      for (int i = 0; i "lt" n; ++i)
        for (int j = 0; j "lt" nCols; ++j)
          result[i][j] = src[i][j];
      return result;
    }

    static double[][] MatStandardize(double[][] data,
      out double[] means, out double[] stds)
    {
      // scikit style z-score biased normalization
      int rows = data.Length;
      int cols = data[0].Length;
      double[][] result = MatMake(rows, cols);

      // compute means
      double[] mns = new double[cols];
      for (int j = 0; j "lt" cols; ++j)
      {
        double sum = 0.0;
        for (int i = 0; i "lt" rows; ++i)
          sum += data[i][j];
        mns[j] = sum / rows;
      } // j

      // compute std devs
      double[] sds = new double[cols];
      for (int j = 0; j "lt" cols; ++j)
      {
        double sum = 0.0;
        for (int i = 0; i "lt" rows; ++i)
          sum += (data[i][j] - mns[j]) *
            (data[i][j] - mns[j]);
        sds[j] = Math.Sqrt(sum / rows);  // biased
      } // j

      // normalize
      for (int j = 0; j "lt" cols; ++j)
      {
        for (int i = 0; i "lt" rows; ++i)
          result[i][j] = 
            (data[i][j] - mns[j]) / sds[j];
      } // j

      means = mns;
      stds = sds;

      return result;
    }

    static double[] MatGetColumn(double[][] m, int j)
    {
      int rows = m.Length;
      double[] result = new double[rows];
      for (int i = 0; i "lt" rows; ++i)
        result[i] = m[i][j];
      return result;
    }

    static double[][] MatTranspose(double[][] m)
    {
      int r = m.Length;
      int c = m[0].Length;
      double[][] result = MatMake(c, r);
      for (int i = 0; i "lt" r; ++i)
        for (int j = 0; j "lt" c; ++j)
          result[j][i] = m[i][j];
      return result;
    }

    static double[][] MatProduct(double[][] matA,
      double[][] matB)
    {
      int aRows = matA.Length;
      int aCols = matA[0].Length;
      int bRows = matB.Length;
      int bCols = matB[0].Length;
      if (aCols != bRows)
        throw new Exception("Non-conformable matrices");

      double[][] result = MatMake(aRows, bCols);

      for (int i = 0; i "lt" aRows; ++i) // each row of A
        for (int j = 0; j "lt" bCols; ++j) // each col of B
          for (int k = 0; k "lt" aCols; ++k)
            result[i][j] += matA[i][k] * matB[k][j];

      return result;
    }

    static double[][] MatExtractCols(double[][] mat,
      int[] cols)
    {
      int srcRows = mat.Length;
      int srcCols = mat[0].Length;
      int tgtCols = cols.Length;

      double[][] result = MatMake(srcRows, tgtCols);
      for (int i = 0; i "lt" srcRows; ++i)
      {
        for (int j = 0; j "lt" tgtCols; ++j)
        {
          int c = cols[j];
          result[i][j] = mat[i][c];
        }
      }
      return result;
    }

    static double VecNorm(double[] vec)
    {
      double sum = 0.0;
      int n = vec.Length;
      for (int i = 0; i "lt" n; ++i)
        sum += vec[i] * vec[i];
      return Math.Sqrt(sum);
    }

    static double VecDot(double[] v1, double[] v2)
    {
      int n = v1.Length;
      double sum = 0.0;
      for (int i = 0; i "lt" n; ++i)
        sum += v1[i] * v2[i];
      return sum;
    }

    static double[] VecExtractFirst(double[] vec, int n)
    {
      double[] result = new double[n];
      for (int i = 0; i "lt" n; ++i)
        result[i] = vec[i];
      return result;
    }

    static double Hypot(double x, double y)
    {
      // fancy sqrt(x^2 + y^2)
      double xabs = Math.Abs(x);
      double yabs = Math.Abs(y);
      double min, max;

      if (xabs "lt" yabs)
      {
        min = xabs; max = yabs;
      }
      else
      {
        min = yabs; max = xabs;
      }

      if (min == 0)
        return max;

      double u = min / max;
      return max * Math.Sqrt(1 + u * u);
    }

    static int[] ArgSort(double[] vec)
    {
      int n = vec.Length;
      int[] idxs = new int[n];
      for (int i = 0; i "lt" n; ++i)
        idxs[i] = i;
      Array.Sort(vec, idxs);  // sort idxs on vec vals
      return idxs;
    }

    // ------------------------------------------------------

    static void MatShow(double[][] M, int dec,
      int wid, bool showIndices)
    {
      double small = 1.0 / Math.Pow(10, dec);
      for (int i = 0; i "lt" M.Length; ++i)
      {
        if (showIndices == true)
        {
          int pad = M.Length.ToString().Length;
          Console.Write("[" + i.ToString().
            PadLeft(pad) + "]");
        }
        for (int j = 0; j "lt" M[0].Length; ++j)
        {
          double v = M[i][j];
          if (Math.Abs(v) "lt" small) v = 0.0;
          Console.Write(v.ToString("F" + dec).
            PadLeft(wid));
        }
        Console.WriteLine("");
      }
    }

    // ------------------------------------------------------

    static void MatShow(double[][] M, int dec,
      int wid, int nRows)
    {
      double small = 1.0 / Math.Pow(10, dec);
      for (int i = 0; i "lt" nRows; ++i)
      {
        for (int j = 0; j "lt" M[0].Length; ++j)
        {
          double v = M[i][j];
          if (Math.Abs(v) "lt" small) v = 0.0;
          Console.Write(v.ToString("F" + dec).
            PadLeft(wid));
        }
        Console.WriteLine("");
      }
      if (nRows "lt" M.Length)
        Console.WriteLine(". . . ");
    }

    // ------------------------------------------------------

    static void VecShow(double[] vec,
      int dec, int wid)
    {
      for (int i = 0; i "lt" vec.Length; ++i)
      {
        double x = vec[i];
        if (Math.Abs(x) "lt" 1.0e-8) x = 0.0;
        Console.Write(x.ToString("F" + dec).
          PadLeft(wid));
      }
      Console.WriteLine("");
    }

    static void VecShow(int[] vec, int wid)
    {
      for (int i = 0; i "lt" vec.Length; ++i)
      {
        int x = vec[i];
        Console.Write(x.ToString().
          PadLeft(wid));
      }
      Console.WriteLine("");
    }

    // ------------------------------------------------------

    static double[] VarExplained(double[] eigenVals)
    {
      // assumes eigenVals are sorted large to small
      int n = eigenVals.Length;
      double[] result = new double[n];
      double sum = 0.0;
      for (int i = 0; i "lt" n; ++i)
        sum += eigenVals[i];
      for (int i = 0; i "lt" n; ++i)
      {
        double pctExplained = eigenVals[i] / sum;
        result[i] = pctExplained;
      }
      return result;
    }

  } // Program

} // ns

Here’s the scikit version:

# iris_pca.py
# demo of PCA on the Iris subset

import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# def mean(x): # np.mean(X, axis = 0)  
#   return sum(x)/len(x)  

# def std(x): # np.std(X, axis = 0)
#   return (sum((i - mean(x))**2 for i in x)/len(x))**0.5

# def Standardize_data(X):
#   return (X - mean(X))/std(X)

print("\nBegin PCA using scikit ")
np.set_printoptions(precision=4, suppress=True,
  floatmode='fixed')

print("\nLoading subset of Penguin data ")
data_file = ".\\Data\\penguin_9.txt"
X = np.loadtxt(data_file, delimiter=",", 
  usecols=[1,2,3,4], comments="#", 
  dtype=np.float64)
print("Done ")

print("\nSource data: ")
print(X)

print("\nStandardizing (z-score, biased) ")
scaler = StandardScaler()
scaler.fit(X)
X_std = scaler.transform(X)

# print(scaler.mean_)
# print(np.sqrt(scaler.var_))

print("\nStandardized data: ")
print(X_std[0:3,:])
print(". . . ")

pca = PCA(n_components=4)
pca.fit(X_std)

# covar_mat = pca.get_covariance()
# print("\ncovariance matrix: ")
# print(covar_mat)

eVals = pca.explained_variance_
print("\nEigenvalues (sorted): ")
print(eVals)

print("\nEigenvectors (sorted, rows): ")
eVecs = pca.components_
print(eVecs)

pctVar = pca.explained_variance_ratio_
print("\npercent variance explained: ")
print(pctVar)

transformed = pca.transform(X_std)
print("\nTransformed data (all components): ")
print(transformed)

# compute transformed from eigenvectors
# trans_check = np.matmul(X_std, eVecs.T)
# print("\nFirst 3 transformed verify: ")
# print(trans_check[0:3,0:2])

print("\nReconstructing " + \
 " original data: ");
reconstructed = np.matmul(transformed, eVecs)
reconstructed *= np.sqrt(scaler.var_)
reconstructed += scaler.mean_
print(reconstructed[0:3,:])
print(". . . ")

print("\nEnd demo ")