Gaussian mixture model (GMM) clustering is an alternative to k-means clustering. Spoiler alert: GMM is an order of magnitude more complex than k-means.
A few days ago I implemented GMM clustering from scratch using Python/NumPy/SciPy. It was a big effort that took several days. On a recent weekend I decided to refactor my from-scratch Python code to C#. This was another big effort because the Python from-scratch code used the scipy.stats module multivariate_normal.pdf() library function to compute a required probability density function.
C# has no such from-scratch library to compute a multivariate Gaussian pdf() function so I had to implement one myself. It’s a difficult problem that requires a from-scratch Cholesky matrix decomposition function, along with several other non-trivial functions.
The shell on the left shows my from-scratch C# implementation of GMM clustering. The shell on the right shows the scikit GaussianMixture library module. The results are (nearly) identical but I had to fiddle with the random initialization seeds.
Anyway, after many hours of work, I got a from-scratch C# GMM clustering program up and running. To validate my C# version, I ran it side by side with the sklearn.mixture library module GaussianMixture() function on a tiny 10-item dummy dataset to make sure I got the same results. I noticed that GMM clustering is extremely sensitive to its random initialization and so I had to try different random seed values until I got identical results.
The GMM clustering algorithm is iterative. The stopping condition is either 1.) reaching a specified max iterations, or 2.) when there is no change (within 1.0e-3) as measured by log-likelihood. Even this detail is a complex topic. The scikit GMM version required 24 iterations until convergence. My from-scratch C# version required 76 iterations. I speculate the difference is due to how how the covariance matrices are initialized.
The few examples of GMM clustering from scratch I found on the Internet were grotesquely incorrect. So, I started from ground zero by using a somewhat obscure transcription of a Stanford University lecture at web.stanford.edu/~lmackey/stats306b/doc/stats306b-spring14-lecture2_scribed.pdf. Here are the key equations. They’re deceptively simple-looking and are actually very difficult to implement.
I designed my GMM class to be completely self-contained with absolutely no external dependencies. Calling the code looks like:
. . . // load data to cluster from file into X[][] string dataFile = "..\\..\\..\\Data\\dummy_10.txt"; double[][] X = MatLoad(dataFile, new int[] { 0, 1 }, ',', "#"); // helper function // 3 clusters, 100 max iterations GMM gmm = new GMM(3, 100, seed: 9); Console.WriteLine("Clustering "); int numIter = gmm.Cluster(X); Console.WriteLine("Done "); double[][] probs = gmm.PredictProbs(X); int[] labels = gmm.PredictLabels(X);
The PredictProbs() function returns the clustering as probabilities of cluster membership. For example if a data item has probabilities [0.1500, 0.7500, 0.1000] then the item is assigned to cluster 1. The PredictLabels() function returns the cluster assignments without the underlying probabilities.
Here’s an example where I added a third column to the 10-item source data. It took me a long time to find the combination of initialization seed values to get the from-scratch C# implementation to match the scikit library version.
While I was experimenting with GMM clustering, I noticed that the technique is extremely sensitive to the random initialization seed. So, the three main weaknesses of GMM clustering are: 1.) extreme complexity, 2.) very high sensitivity, 3.) inability to deal with non-numeric data.
The bottom line is that implementing GMM clustering from scratch using C# was a large effort — one of the most complex systems I’ve ever written. But it was a lot of fun and quite satisfying.
Author Erle Stanley Gardner (1889-1970) wrote 86 Perry Mason mystery novels from 1933 to 1973. The books were also adapted to a radio show and a long-running TV show (1957-66). Here are three PM book covers that I’d cluster as having visually similar styles — sort of a 1960s feel.
Left: In “The Case of the Terrified Typist” (1956), a woman seeks Mason’s help to escape criminals who steal gems. Cover art by Harry Sheldon.
Center: In “The Case of the Caretaker’s Cat” (1935), Mason solves a murder related to an old man’s will and greedy heirs. Cover art by Robert McGinnis.
Right: In “The Case of the Crooked Candle” (1944), a man is found murdered on his yacht, along with a leaning candle which holds the key to the mystery. Cover art by Mitchell Hooks.
Demo code. Very long and very complex. Not thoroughly tested. Replace “lt”, “gt”, “lte”, “gte”, “and” with Boolean operator symbols.
using System; using System.IO; namespace GaussianMixture { internal class GaussianMixtureProgram { static void Main(string[] args) { Console.WriteLine("\nBegin GMM from scratch C# "); Console.WriteLine("\nLoading data "); // hard-coded double[][] X = new double[10][]; // 10x2 X[0] = new double[] { 0.01, 0.10 }; X[1] = new double[] { 0.02, 0.09 }; X[2] = new double[] { 0.03, 0.10 }; X[3] = new double[] { 0.04, 0.06 }; X[4] = new double[] { 0.05, 0.06 }; X[5] = new double[] { 0.06, 0.05 }; X[6] = new double[] { 0.07, 0.05 }; X[7] = new double[] { 0.08, 0.01 }; X[8] = new double[] { 0.09, 0.02 }; X[9] = new double[] { 0.10, 0.01 }; // read from file // string dataFile = "..\\..\\..\\Data\\dummy_10.txt"; // double[][] X = MatLoad(dataFile, // new int[] { 0, 1 }, ',', "#"); // Console.WriteLine("Done "); Console.WriteLine("\nData: "); MatShow(X, 2, 6); int mi = 100; // max iterations Console.WriteLine("\nCreating C# scratch" + " GMM model k=3, maxIter=100 "); GMM gmm = new GMM(3, mi, seed: 9); Console.WriteLine("\nClustering "); int numIter = gmm.Cluster(X); Console.WriteLine("Done. Used " + numIter + " iterations "); double[][] probs = gmm.PredictProbs(X); int[] labels = gmm.PredictLabels(X); Console.WriteLine("\nClustering results: "); for (int i = 0; i "lt" probs.Length; ++i) { VecShow(X[i], 2, 6, false); Console.Write(" | "); VecShow(probs[i], 4, 9, false); Console.Write(" | "); Console.WriteLine(labels[i]); } Console.WriteLine("\nmeans: "); MatShow(gmm.means, 4, 9); Console.WriteLine("\ncovariances: "); for (int k = 0; k "lt" 3; ++k) { MatShow(gmm.covars[k], 4, 9); Console.WriteLine(""); } Console.WriteLine("\ncoefficients: "); VecShow(gmm.coefs, 4, 9, true); Console.WriteLine("\nEnd demo "); Console.ReadLine(); } // Main // ------------------------------------------------------ static double[][] MatCreate(int rows, int cols) { // helper for MatLoad() double[][] result = new double[rows][]; for (int i = 0; i "lt" rows; ++i) result[i] = new double[cols]; return result; } // ------------------------------------------------------ static int NumNonCommentLines(string fn, string comment) { // helper for MatLoad() int ct = 0; string line = ""; FileStream ifs = new FileStream(fn, FileMode.Open); StreamReader sr = new StreamReader(ifs); while ((line = sr.ReadLine()) != null) if (line.StartsWith(comment) == false) ++ct; sr.Close(); ifs.Close(); return ct; } // ------------------------------------------------------ static double[][] MatLoad(string fn, int[] usecols, char sep, string comment) { // count number of non-comment lines int nRows = NumNonCommentLines(fn, comment); int nCols = usecols.Length; double[][] result = MatCreate(nRows, nCols); string line = ""; string[] tokens = null; FileStream ifs = new FileStream(fn, FileMode.Open); StreamReader sr = new StreamReader(ifs); int i = 0; while ((line = sr.ReadLine()) != null) { if (line.StartsWith(comment) == true) continue; tokens = line.Split(sep); for (int j = 0; j "lt" nCols; ++j) { int k = usecols[j]; // into tokens result[i][j] = double.Parse(tokens[k]); } ++i; } sr.Close(); ifs.Close(); return result; } // ------------------------------------------------------ static void MatShow(double[][] m, int dec, int wid) { for (int i = 0; i "lt" m.Length; ++i) { for (int j = 0; j "lt" m[0].Length; ++j) { double v = m[i][j]; if (Math.Abs(v) "lt" 1.0e-5) v = 0.0; // hack Console.Write(v.ToString("F" + dec).PadLeft(wid)); } Console.WriteLine(""); } } // ------------------------------------------------------ static void VecShow(double[] vec, int dec, int wid, bool newLine) { for (int i = 0; i "lt" vec.Length; ++i) { double x = vec[i]; if (Math.Abs(x) "lt" 1.0e-5) x = 0.0; // avoid "-0.0" Console.Write(x.ToString("F" + dec).PadLeft(wid)); } if (newLine == true) Console.WriteLine(""); } } // Program // ======================================================== public class GMM { public int k; // number components public int maxIter; public Random rnd; // for initialization public int N; // number data items public int dim; // per data item public double[] coefs; public double[][] means; // init'ed by Cluster() public double[][][] covars; // init'ed by Cluster() public double[][] probs; // computed clustering probs public GMM(int k, int maxIter, int seed) { this.k = k; this.maxIter = maxIter; this.rnd = new Random(seed); this.N = 0; this.dim = 0; this.coefs = new double[k]; for (int j = 0; j "lt" k; ++j) this.coefs[j] = 1.0 / k; } // ctor // ------------------------------------------------------ public int Cluster(double[][] X) { this.N = X.Length; this.dim = X[0].Length; // init means to k random data items this.means = LocalMatCreate(this.k, this.dim); int[] idxs = Select(this.N, this.k); for (int j = 0; j "lt" this.k; ++j) { int idx = idxs[j]; for (int d = 0; d "lt" this.dim; ++d) this.means[j][d] = X[idx][d]; } // init covars to (dim by dim) Identity matrices this.covars = LocalMat3DCreate(this.k, this.dim, this.dim); for (int j = 0; j "lt" this.k; ++j) // each component { this.covars[j] = LocalMatIdentity(this.dim); } // instantiate this.probs this.probs = LocalMatCreate(this.N, this.k); for (int i = 0; i "lt" this.N; ++i) for (int j = 0; j "lt" this.k; ++j) this.probs[i][j] = 1.0 / this.k; // use EM meta-algorithm to compute this.probs int iter; for (iter = 0; iter "lt" this.maxIter; ++iter) { //if (verbose == true) // Console.WriteLine("\niter = " + iter); double oldMeanLogLike = MeanLogLikelihood(); this.ExpectStep(X); // update the probs double newMeanLogLike = MeanLogLikelihood(); // auto-stop if (iter "gt" this.maxIter / 2 "and" Math.Abs(newMeanLogLike - oldMeanLogLike) "lt" 1.0e-6) { break; } this.MaximStep(X); // coefs, means, covars } return iter; } // Cluster // ------------------------------------------------------ public double[][] PredictProbs(double[][] X) { double[][] result = LocalMatCreate(this.N, this.k); for (int j = 0; j "lt" this.k; ++j) { for (int i = 0; i "lt" this.N; ++i) { result[i][j] = this.coefs[j] * LocalMatGaussianPdf(X[i], this.means[j], this.covars[j]); // this is the hard part } } // make each row sum to 1 for (int i = 0; i "lt" this.N; ++i) { double rowSum = 0.0; for (int j = 0; j "lt" this.k; ++j) rowSum += result[i][j]; for (int j = 0; j "lt" this.k; ++j) result[i][j] /= rowSum; } return result; } // PredictProbs // ------------------------------------------------------ public int[] PredictLabels(double[][] X) { // mimics the scikit GaussianMixture.predict() int[] result = new int[this.N]; double[][] probs = this.PredictProbs(X); // Nxk for (int i = 0; i "lt" this.N; ++i) { for (int j = 0; j "lt" this.k; ++j) { double[] p = probs[i]; int maxIdx = ArgMax(p); result[i] = maxIdx; } } return result; } // ------------------------------------------------------ private static int ArgMax(double[] vec) { // helper for PredictLabels() int n = vec.Length; double maxVal = vec[0]; int maxIdx = 0; for (int i = 0; i "lt" n; ++i) { if (vec[i] "gt" maxVal) { maxVal = vec[i]; maxIdx = i; } } return maxIdx; } // ------------------------------------------------------ private void ExpectStep(double[][] X) { // use new means and covariance matrices update probs this.probs = this.PredictProbs(X); } // ------------------------------------------------------ private void MaximStep(double[][] X) { // use new probs update means and covariance matrices // 0. compute new prob col sums needed to update double[] probSums = new double[this.k]; for (int i = 0; i "lt" this.N; ++i) for (int j = 0; j "lt" this.k; ++j) probSums[j] += this.probs[i][j]; // 1. update mixture coefficients directly for (int j = 0; j "lt" this.k; ++j) this.coefs[j] = probSums[j] / this.N; // 2. update means indiectly then copy into this.means // uj = S(i,n)[pij * xi] / S(i,n)[pij] double[][] newMeans = LocalMatCreate(this.k, this.dim); for (int j = 0; j "lt" this.k; ++j) for (int i = 0; i "lt" this.N; ++i) for (int d = 0; d "lt" this.dim; ++d) newMeans[j][d] += X[i][d] * this.probs[i][j]; for (int j = 0; j "lt" this.k; ++j) for (int d = 0; d "lt" this.dim; ++d) newMeans[j][d] /= probSums[j]; for (int row = 0; row "lt" this.k; ++row) // copy for (int col = 0; col "lt" this.dim; ++col) this.means[row][col] = newMeans[row][col]; // 3. update covariances indirectly then copy // Cj = S(i,n)[pij * (xi-uj) * (xi-uj)T] / S(i,n)[pij] double[][][] newCovars = LocalMat3DCreate(this.k, this.dim, this.dim); for (int j = 0; j "lt" this.k; ++j) { double[] u = this.means[j]; // mean for component for (int i = 0; i "lt" this.N; ++i) { double[] x = X[i]; double scale = this.probs[i][j]; double[][] tmp = LocalVecVecScale(x, u, scale); // accumulate for (int row = 0; row "lt" this.dim; ++row) for (int col = 0; col "lt" this.dim; ++col) newCovars[j][row][col] += tmp[row][col]; } // i // divide curr covar by prob_sums for (int row = 0; row "lt" this.dim; ++row) for (int col = 0; col "lt" this.dim; ++col) newCovars[j][row][col] /= probSums[j]; // condition the diagonal for (int row = 0; row "lt" this.dim; ++row) newCovars[j][row][row] += 1.0e-6; } // j // copy indirect result into this.covars for (int j = 0; j "lt" this.k; ++j) for (int row = 0; row "lt" this.dim; ++row) for (int col = 0; col "lt" this.dim; ++col) this.covars[j][row][col] = newCovars[j][row][col]; } // MaximStep // ------------------------------------------------------ public double MeanLogLikelihood() { // used for auto-stopping double sum = 0.0; // for all rows for (int i = 0; i "lt" this.N; ++i) { double rowSum = 0.0; for (int j = 0; j "lt" this.k; ++j) { if (this.probs[i][j] "gt" 1.0e-6) { rowSum += Math.Log(this.probs[i][j]); } } sum += rowSum; } return sum / this.N; } // MeanLogLikelihood // ------------------------------------------------------ // // lots of local helper functions needed // // ------------------------------------------------------ static double[][] LocalVecVecScale(double[] x, double[] u, double scale) { // helper for maximization step // (x-u) * (x-u)T * scale int n = u.Length; double[] x_minus_u = new double[n]; for (int i = 0; i "lt" n; ++i) x_minus_u[i] = x[i] - u[i]; double[][] result = LocalMatCreate(n, n); for (int i = 0; i "lt" n; ++i) for (int j = 0; j "lt" n; ++j) result[i][j] = x_minus_u[i] * x_minus_u[j] * scale; return result; } // ------------------------------------------------------ private int[] Select(int N, int n) { // select n distinct indices from [0,N-1] inclusive int[] indices = new int[N]; for (int i = 0; i "lt" N; ++i) indices[i] = i; this.Shuffle(indices); int[] result = new int[n]; for (int i = 0; i "lt" n; ++i) result[i] = indices[i]; return result; } // ------------------------------------------------------ private void Shuffle(int[] indices) { // helper for Select() // Fisher-Yates shuffle int n = indices.Length; for (int i = 0; i "lt" n; ++i) { int j = this.rnd.Next(i, n); int tmp = indices[i]; indices[i] = indices[j]; indices[j] = tmp; } } // ------------------------------------------------------ static double LocalMatGaussianPdf(double[] x, double[] u, double[][] covar) { // multivariate Gaussian distribution PDF // x and u must be convert to column vectors/matrices double[][] X = LocalVecToMat(x, x.Length, 1); double[][] U = LocalVecToMat(u, u.Length, 1); int k = x.Length; // dimension double[][] a = LocalMatTranspose(LocalMatDiff(X, U)); double[][] L = LocalMatCholesky(covar); double[][] b = LocalMatInverseFromCholesky(L); double[][] c = LocalMatDiff(X, U); double[][] d = LocalMatProduct(a, b); double[][] e = LocalMatProduct(d, c); double numer = Math.Exp(-0.5 * e[0][0]); double f = Math.Pow(2 * Math.PI, k); double g = LocalMatDeterminantFromCholesky(L); double denom = Math.Sqrt(f * g); double pdf = numer / denom; return pdf; } // ------------------------------------------------------ static double[][] LocalVecToMat(double[] vec, int nRows, int nCols) { // helper for LocalMatGaussianPdf() double[][] result = LocalMatCreate(nRows, nCols); int k = 0; for (int i = 0; i "lt" nRows; ++i) for (int j = 0; j "lt" nCols; ++j) result[i][j] = vec[k++]; return result; } // ------------------------------------------------------ static double[][] LocalMatTranspose(double[][] m) { // helper for LocalMatGaussianPdf() int nr = m.Length; int nc = m[0].Length; double[][] result = LocalMatCreate(nc, nr); // note for (int i = 0; i "lt" nr; ++i) for (int j = 0; j "lt" nc; ++j) result[j][i] = m[i][j]; return result; } // ------------------------------------------------------ static double[][] LocalMatDiff(double[][] A, double[][] B) { // helper for LocalMatGaussianPdf() int rows = A.Length; int cols = A[0].Length; double[][] result = LocalMatCreate(rows, cols); for (int i = 0; i "lt" rows; ++i) for (int j = 0; j "lt" cols; ++j) result[i][j] = A[i][j] - B[i][j]; return result; } // ------------------------------------------------------ static double[][] LocalMatProduct(double[][] matA, double[][] matB) { // helper for LocalMatGaussianPdf() int aRows = matA.Length; int aCols = matA[0].Length; int bRows = matB.Length; int bCols = matB[0].Length; if (aCols != bRows) throw new Exception("Non-conformable matrices"); double[][] result = LocalMatCreate(aRows, bCols); for (int i = 0; i "lt" aRows; ++i) // each row of A for (int j = 0; j "lt" bCols; ++j) // each col of B for (int k = 0; k "lt" aCols; ++k) result[i][j] += matA[i][k] * matB[k][j]; return result; } // ------------------------------------------------------ static double[][] LocalMatCholesky(double[][] m) { // helper for LocalMatGaussianPdf() // Cholesky decomposition // m is square, symmetric, positive definite // typically a covariance matrix int n = m.Length; double[][] result = LocalMatCreate(n, n); for (int i = 0; i "lt" n; ++i) { for (int j = 0; j "lte" i; ++j) { double sum = 0.0; for (int k = 0; k "lt" j; ++k) sum += result[i][k] * result[j][k]; if (i == j) { double tmp = m[i][i] - sum; if (tmp "lt" 0.0) throw new Exception("MatCholesky fatal error "); result[i][j] = Math.Sqrt(tmp); } else { if (result[j][j] == 0.0) throw new Exception("MatCholesky fatal error "); result[i][j] = (1.0 / result[j][j] * (m[i][j] - sum)); } } // j } // i return result; } // LocalMatCholesky // ------------------------------------------------------ static double[][] LocalMatInverseFromCholesky(double[][] L) { // helper for LocalMatGaussianPdf() // L is a lower triangular result of Cholesky decomp // direct version int n = L.Length; double[][] result = LocalMatIdentity(n); for (int k = 0; k "lt" n; ++k) { for (int j = 0; j "lt" n; j++) { for (int i = 0; i "lt" k; i++) { result[k][j] -= result[i][j] * L[k][i]; } result[k][j] /= L[k][k]; } } for (int k = n - 1; k "gte" 0; --k) { for (int j = 0; j "lt" n; j++) { for (int i = k + 1; i "lt" n; i++) { result[k][j] -= result[i][j] * L[i][k]; } result[k][j] /= L[k][k]; } } return result; } // LocalMatInverseFromCholesky // ------------------------------------------------------ static double LocalMatDeterminantFromCholesky(double[][] L) { // helper for LocalMatGaussianPdf() // product of squared diag elements of L double result = 1.0; int n = L.Length; for (int i = 0; i "lt" n; ++i) result *= L[i][i] * L[i][i]; return result; } // -------------------------------------------------------- static double[][] LocalMatIdentity(int n) { // used by LocalMatInverseFromCholesky and covers init double[][] result = LocalMatCreate(n, n); for (int i = 0; i "lt" n; ++i) result[i][i] = 1.0; return result; } // ------------------------------------------------------ static double[][] LocalMatCreate(int rows, int cols) { // used by many of the helper functions double[][] result = new double[rows][]; for (int i = 0; i "lt" rows; ++i) result[i] = new double[cols]; return result; } // ------------------------------------------------------ static double[][][] LocalMat3DCreate(int factors, int rows, int cols) { // used to init this.covars double[][][] result = new double[factors][][]; for (int f = 0; f "lt" factors; ++f) { result[f] = LocalMatCreate(rows, cols); } return result; } // ------------------------------------------------------ static void LocalMatShow(double[][] m, int dec, int wid) { // used by verbose == true methods for (int i = 0; i "lt" m.Length; ++i) { for (int j = 0; j "lt" m[0].Length; ++j) { double v = m[i][j]; if (Math.Abs(v) "lt" 1.0e-5) v = 0.0; // avoid "-0" Console.Write(v.ToString("F" + dec).PadLeft(wid)); } Console.WriteLine(""); } } // ------------------------------------------------------ static void LocalVecShow(double[] vec, int dec, int wid) { // used by verbose == true methods for (int i = 0; i "lt" vec.Length; ++i) { double x = vec[i]; if (Math.Abs(x) "lt" 1.0e-5) x = 0.0; // avoid "-0" Console.Write(x.ToString("F" + dec).PadLeft(wid)); } Console.WriteLine(""); } // ------------------------------------------------------ } // class GMM // ======================================================== } // ns
The dummy dataset:
# dummy_10.txt 0.01, 0.10 0.02, 0.09 0.03, 0.10 0.04, 0.06 0.05, 0.06 0.06, 0.05 0.07, 0.05 0.08, 0.01 0.09, 0.02 0.10, 0.01
Here’s the scikit version:
# gmm_scikit.py # Gaussian mixture model clustering import numpy as np from sklearn.mixture import GaussianMixture # ----------------------------------------------------------- def main(): print("\nBegin GMM using scikit GaussianMixture ") np.random.seed(0) np.set_printoptions(sign=" ", precision=4, floatmode="fixed", suppress=True) print("\nLoading data ") data_file = ".\\Data\\dummy_10.txt" X = np.loadtxt(data_file, usecols = (0,1), delimiter = ",", comments="#", dtype=np.float64) print("Done ") print("\ndata: ") print(X) print("\nCreating scikit GMM model k=3 ") gmm = GaussianMixture(n_components=3, random_state=0, init_params='random') print("\nClustering ") gmm.fit(X) print("Done ") probs = gmm.predict_proba(X) labels = gmm.predict(X) n = gmm.n_iter_ print("number iterations = " + str(n)) print("\nFinal scikit clustering: ") for i in range(len(probs)): print(X[i], end="") print(" | ", end="") print(probs[i], end = "") print(" | ", end="") print(labels[i]) print("\nmeans: ") print(gmm.means_) print("\ncovariances: ") print(gmm.covariances_) print("\ncoefficients: ") print(gmm.weights_) print("\nEnd demo ") if __name__ == "__main__": main()
You must be logged in to post a comment.