Matrix Decomposition

In ordinary arithmetic, the decomposition of a number is finding two numbers that when multiplied together give the original number. For example, the number 12 can be decomposed into 3 and 4 because 3 * 4 = 12.

There’s a similar idea with matrices. The goal is to start with some matrix M then find two matrices A and B so that A * B = M where the * means matrix multiplication.

Before I go any further I’ll point out that matrix decomposition is important because the technique is used to find the inverse of a matrix, and the matrix inverse is one of the most common and important techniques in all of mathematics.

There are several algorithms you can use to decompose a matrix. I usually use Croat’s algorithm.

I coded up a demo using both C# and JavaScript. In my demo, I start with a matrix M =

 3.0  7.0   2.0   5.0
 1.0  8.0   4.0   2.0
 2.0  1.0   9.0   3.0
 5.0  4.0   7.0   1.0

Weirdly, the algorithm returns the two A and B decompositions combined into one matrix called the LUM (lower-upper matrix):

 5.0000   4.0000   7.0000   1.0000
 0.2000   7.2000   2.6000   1.8000
 0.4000  -0.0833   6.4167   2.7500
 0.6000   0.6389  -0.6017   4.9048

Extracting the lower and upper matrices gives Lower =

 1.0000   0.0000   0.0000   0.0000
 0.2000   1.0000   0.0000   0.0000
 0.4000  -0.0833   1.0000   0.0000
 0.6000   0.6389  -0.6017   1.0000

and Upper =

 5.0000   4.0000   7.0000   1.0000
 0.0000   7.2000   2.6000   1.8000
 0.0000   0.0000   6.4167   2.7500
 0.0000   0.0000   0.0000   4.9048

If you look at LUM and Lower and Upper you can see that Lower has the lower left corner of the LUM with dummy 1.0 values on the diagonal and 0.0 values in the upper right. The Upper has the upper right corner of the LUM, including the diagonal, and 0.0 values in the lower left.

OK, now if you multiply Lower times upper you get (drum roll please):

 5.0   4.0   7.0   1.0
 1.0   8.0   4.0   2.0
 2.0   1.0   9.0   3.0
 3.0   7.0   2.0   5.0

Ooops. Lower * Upper isn’t quite the original matrix M. Some of the rows have been rearranged. Well, as it turn out, the decomposition, in addition to returning the combined LUM matrix, also returns a perm (permutation) vector that tells you how to rearrange the rows of Lower * Upper in order to get the original matrix M back. For my example, the perm vector is:

 3   1   2   0

This means row [3] of Lower * Upper goes to row [0], and so on. I wrote a helper function called applyPerm(m, perm) to programmatically switch the rows.

Well, there’s actually quite a few more details about matrix decomposition, but if you found this blog post looking for a simple explanation, you should have a pretty good grasp of the main ideas.

Artist Thomas Schaller specializes architecture-related scenes, using watercolor. A very difficult technique to get right.

Posted in Miscellaneous | Leave a comment

Weighted k-NN Classification Using Python

I wrote an article titled, “Weighted k-NN Classification Using Python” in the April 2019 issue of Visual Studio Magazine. See

The weighted k-nearest neighbors (k-NN) classification algorithm is a relatively simple technique to predict the class of an item based on two or more numeric predictor variables. For example, you might want to predict the political party affiliation (democrat, republican, independent) of a person based on their age, annual income, gender, years of education and so on. In the article I explain how to implement the weighted k-nearest neighbors algorithm using Python.

I present a demo program that uses dummy data with two numeric input values and three classes (0 = red, 1 = blue, 2 = green) to predict. The k-NN algorithm is very simple. First you select a value for k (like k = 4). Then, for the item to predict the class of, you examine your training data and find the 4 (or whatever k is) nearsest items. Suppose those four nearest items have classes red, blue, red, red. Your prediction is “red”.

But there are a lot of details. How do you measure “closest”? How do you give more weight to training items that are very close to the item-to-predict? How do you deal with predictor variables that aren’t numeric?

The weighted k-NN classification algorithm has received increased attention recently for two reasons. First, by using neural autoencoding, k-NN can deal with mixed numeric and non-numeric predictor values. Second, compared to many other classification algorithms, notably neural networks, the results of weighted k-NN are relatively easy to interpret. Interpretability has become increasingly important due to legal requirements of machine learning techniques from laws such as the European Union’s GDPR.

Posted in Machine Learning | Leave a comment

Neural Network Momentum using JavaScript

I’ve been slowly but surely implementing core neural network functionality using the JavaScript language. My most recent exploration was modifying a back-propagation function so that it used momentum. The motivation is to speed up training. The idea is best explained using code. Here’s a snippet of the training code:

// update input-to-hidden weights
for (let i = 0; i <; ++i) {
  for (let j = 0; j < this.nh; ++j) {
    let delta = -1.0 * lrnRate * ihGrads[i][j];
    this.ihWeights[i][j] += delta;
    this.ihWeights[i][j] += momRate * ihPrevWtsDelta[i][j];
    ihPrevWtsDelta[i][j] = delta;  // save delta for next time

This code updates ihWeights[i][j] which are the weights that connect input node [i] to hidden node [j]. Each weight is incremented by a quantity delta which depends on the gradient for the i-j weight times a small value such as 0.01 which is the learning rate. This adding delta update decreases the error of the network. (The gradients are computed earlier, and that’s the tricky part).

Momentum adds a second quantity to each weight. The momentum quantity is the value of the previous delta times a small value like 0.6 which is the momentum rate. The effect of this is that as long as the updates are improving, the amount added each times increases.

Now eventually, each weight value will overshoot an optimal value, and the updates will change signs and reduce the value. But experience has shown that this process speeds up training in the long run.

Suppose you’re blindfolded and are trying to reach a goal that is, unknown to you, 100 feet ahead. You could take one step and check to see if you’re at the goal yet, Then take another step, and so on until you get close enough to the goal. This is like regular back-propagation training.

Or you could take one step, and determine you’re not at the goal yet. Next you take two steps. Then four steps, then eight steps and so on. Eventually (rather quickly) you would overshoot your goal that’s 100 feet ahead, and then you’d turn around and go back until you reach the goal. This is like momentum. Your progress might look something like:

at 0,  take 1 step,   at 1
at 1,  take 2 steps,  at 3
at 3,  take 4 steps,  at 7
at 7,  take 8 steps,  at 15
at 15, take 16 steps, at 31
at 31, take 32 steps, at 63
at 63, take 64 steps, at 127
at 127, go back 32 steps, at 95
at 95, take 4 steps, at 99
close enough, stop.

This is highly simplified and just an analogy of course but should help you understand how momentum speeds up training.

Four famous pinball machines. “Baffle Ball” (1931) – First widely successful machine, no flippers. “Humpty Dumpty” (1947) – First machine with flippers. “Knock Out” (1950) – Very high quality machine for its time. “Creature from the Black Lagoon” (1992) – Considered by fans to be one of the top-50 best modern machines.

Posted in JavaScript, Machine Learning | 2 Comments

I Give a Talk on Binary Classification Using Keras

I recently gave a short workshop/talk at the tech company I work for on binary classification using the Keras neural network code library.

The goal of a binary classification problem is to predict something that can take on one of just two possible values. For example, you might want to predict the sex (male or female) of a person based on their height, annual income, and credit rating. There are many different techniques you can use for binary classification. Older techniques include logistic regression, support vector machine, naive Bayes, k-nearest neighbors, decision trees, and several others.

Using a neural network for binary classification has several advantages over other techniques. But NNs have disadvatages too, most notably they 1.) require lots of training data, and 2.) the resulting model is difficult to interpret (the model that is, not a specific prediction).

It’s possible to write neural network code from scratch using a language like Java, C#, or Python. But writing from scratch is very time consuming. So, the usual approach is to use a neural network code library. Amonmg my colleagues, the most common libraries are Keras, TensorFlow, PyTorch, scikit-learn, and MXNet, but there are dozens more.

Keras operates at a relatively high level of abstraction which means that you can get a system up and running fairly quickly. The disadvantage of Keras is that it’s moe difficult to customize than TensorFlow and PyTorch. Keras uses the Python language which is by far the most common language for ML. And note that Keras actually relies on TensorFlow.

For my talk, I used the Cleveland Heart Disease dataset. The goal is to predict whether a patient will develop heart disease based on 13 predictor variables (age, sex, cholesterol level, etc.)

Even though the program code is quite short, it’s extremely dense, meaning there are many ideas associated with each line of code. The output of the trained neural network is a single value between 0.0 and 1.0 where a value of less than 0.5 means a prediction of no heart disease, and an output value greater than 0.5 means heart disease.

Two images from my “What the Heck” collection. (Click to enlarge image) Left: “Legendary blind chess player Zsiltzova Lubov looks over the game of Anna Stolarczyk” (!) Right: Perhaps the best newspaper headline in history.

Posted in Keras, Machine Learning | 1 Comment

Selecting n Random Items Using JavaScript

Even though I’m not a huge fan of JavaScript, I like the language more than most of my engineering / developer colleagues. Just for fun I decided to code up a function that selects n random integers from N. For example, if N = 7 and n = 3 that means pick three different random integers from 0 to 6 inclusive. One possible result could be (4, 0, 2). I decided to call my function select(N, n).

This is similar to the NumPy choice() function. As usual, even the simplest problem has dozens of different possible designs. Because my select() function needs to use random functionality, I decided to place select() inside a random class, as opposed to having select() as a standalone function that either a.) assumes the existence of a global random object, or b.) has a rndObj parameter, or c.) instantiates a rndObj object in the body. Whew! design decisions are a lot more complex in some sense than implementation.

My implementation is simple. First I generate an array with values from 0 to N-1, for example [0, 1, 2, 3, 4, 5, 6]. Then I use the Fisher-Yates shuffle to randomly scramble the order, for example [4, 0, 2, 6, 1, 5, 3]. Then I mask off the first n values, for example [4, 0, 2].

A different algorithm to select n random items from N is called reservoir sampling. Reservoir sampling is useful when N is very large (say, over 100,000). If you search this blog site for “reservoir” you’ll find several posts on the technique.

Good fun.

Before the rise of CGI technology in movies, amazing special effects were created by using large matte (“masked”) paintings on sheets of glass. From left: “Return of the Jedi” (1983), “Spartacus” (1960), “Raiders of the Lost Ark” (1981), “Earthquake” (1974).

Posted in JavaScript, Miscellaneous | Leave a comment

The Difference Between Cross Entropy and Binary Cross Entropy

The basic ideas of cross entropy error and binary cross entropy error are relatively simple. But they’re often a source of confusion for developers who are new to machine learning because of the many topics related to how the two forms of entropies are used.

First, there’s general cross entropy error, which is used to measure the difference between two sets of two or more values that sum to 1.0. Put more accurately, cross entropy error measures the difference between a correct probability distribution and a predicted distribution.

Suppose correct = (0.2, 0.5, 0.3) and predicted = (0.1, 0.3, 0.6). The basic math cross entropy error is computed as:

ce = -[0.2 * log(0.1) + 0.5 * log(0.3) + 0.3 * log(0.6)]
   = -[0.2 * -2.30    + 0.5 * -1.20    + 0.3 * -0.15]
   = -[0.46 + -0.60 + -0.15]
   = 1.22

In words, “the negative sum of the corrects times the log of the predicteds.”

But in machine learning contexts, the correct probability distribution is usually a vector with one 1 value and the rest 0 values, for example, (0, 1, 0). So if correct = (0, 1, 0) and predicted = (0.1, 0.7, 0.2) then:

ce = -[0 * log(0.1) + 1 * log(0.7) + 0 * log(0.2)]
   = -[0 + (1 * -0.36) + 0]
   = 0.36

Notice that when the correct probability distribution has one 1 and the rest 0s, all the terms in the cross entropy calculation drop out except for one because of the multiply by zero terms. As a minor detail, note that if the predicted probability values that corresponds to the correct distribution 1 value, you’d have a log(0) term which is negative infinity so a code implementation would blow up.

Now, unfortunately, binary cross entropy is a special case for machine learning contexts but not for general mathematics cases. Suppose you have a coin flip with correct = (0.5, 0.5) and predicted = (0.6, 0.4). This is a binary problem because there are only two outcomes. Binary cross entropy can be calculated as above with no problem.

Or suppose you have a different ML problem with correct = (1, 0) and predicted = (0.8, 0.2). Again you can compute binary cross entropy in the usual way without any problem.

But, alas, in machine learning contexts, binary correct and predicted values are almost never encoded as pairs of values. Instead they’re stored as just one value. The idea is that for a binary problem if you know one probability, you automatically know the other. For example, suppose predicted is (0.7, 0.3). All you need to store is the 0.7 value because the other probability must be 1 – 0.7 = 0.3.

So, for a binary classification problem where computed and correct probabilities are stored as single values, the standard definition of cross entropy can’t be applied. Instead, binary cross entropy becomes just, “the log of the single predicted value.”

So, what’s the point? If you’re implementing ML code from scratch, and you use the standard technique of storing just single values for binary problems, you’ve got to have two code paths – one for binary problems and one for problems with three or more values. If you’re implementing ML code using a code library, you’ve got to carefully read the library documentation to see how the library deals with binary cross entropy error.

Four images from “Entropy and Art: The View Beyond Arnheim”, Matilde Marcolli. See I understand machine learning very well. I don’t understand art very well, and I don’t think anyone can.

Posted in Machine Learning | Leave a comment

Applying a Permutation to an Array

A permutation of order n is an arrangement of the numbers 0 through n-1. For example, one permutation of order 5 is (4, 2, 0, 1, 3). Suppose you have an array of (“C”, “E”, “A”, “B”, “D”) and the problem is to apply the permutation to the array. In words, “the value in array cell 4 goes to result cell 0, the value in array cell 2 goes to result cell 1, the value in array cell 0 goes to result cell 2, the value in array cell 1 goes to result cell 3, the value in array cell 3 goes to result cell 4”. The result would be (“D”, “A”, “C”, “E”, “B”).

The best way to implement this as a function is very simple. For example, in Python:

def apply_perm_simple(arr, perm):
  n = len(arr)
  result = np.empty(n, dtype=object)
  for i in range(n):
    result[i] = arr[perm[i]]
  return result

End of story. Almost.

Notice the code above creates a result array and then copies into that array. An old interview question asks how to apply a permutation to an array in place, without using a result array. It’s possible, for example:

def apply_perm_clever(arr, perm):
  n = len(arr)
  for i in range(n):
    j = perm[i]
    while j < i:
      j = perm[j]
    swap(arr, i, j)
  return arr

def swap(arr, i, j):
  tmp = arr[i]
  arr[i] = arr[j]
  arr[j] = tmp

The algorithm is quite subtle and the only way to really understand it (for me at least) is to walk through the code line by line and examine the values of i, j, and the source array.

There is a slightly different version of this algorithm that reduces the number iterations in the while-loop at the expense of having to track changes by modifying the perm parameter.

I coded a demo where I set an array = (“cow”, “dog”, “bat”, “ant”, “gnu”, “elk”, “fox”). I used the built-in NumPy argsort() function to get a permutation of (3, 2, 0, 1, 5, 6, 4) which when applied to the array will give an array that’s sorted in alphabetical order which makes it easy to tell if the demo code is working correctly or not.

This is an interesting little problem. Let me emphasize that the “clever” approach would never be used in practice — it’s just way too complicated.

Three images from an Internet search for “too complicated”. Left: fashion. Center: music. Right: movie plot (“Predestination”, 2014). Interesting.

Posted in Machine Learning | Leave a comment