There’s a similar idea with matrices. The goal is to start with some matrix M then find two matrices A and B so that A * B = M where the * means matrix multiplication.

Before I go any further I’ll point out that matrix decomposition is important because the technique is used to find the inverse of a matrix, and the matrix inverse is one of the most common and important techniques in all of mathematics.

There are several algorithms you can use to decompose a matrix. I usually use Croat’s algorithm.

I coded up a demo using both C# and JavaScript. In my demo, I start with a matrix M =

3.0 7.0 2.0 5.0 1.0 8.0 4.0 2.0 2.0 1.0 9.0 3.0 5.0 4.0 7.0 1.0

Weirdly, the algorithm returns the two A and B decompositions combined into one matrix called the LUM (lower-upper matrix):

5.0000 4.0000 7.0000 1.0000 0.2000 7.2000 2.6000 1.8000 0.4000 -0.0833 6.4167 2.7500 0.6000 0.6389 -0.6017 4.9048

Extracting the lower and upper matrices gives Lower =

1.0000 0.0000 0.0000 0.0000 0.2000 1.0000 0.0000 0.0000 0.4000 -0.0833 1.0000 0.0000 0.6000 0.6389 -0.6017 1.0000

and Upper =

5.0000 4.0000 7.0000 1.0000 0.0000 7.2000 2.6000 1.8000 0.0000 0.0000 6.4167 2.7500 0.0000 0.0000 0.0000 4.9048

If you look at LUM and Lower and Upper you can see that Lower has the lower left corner of the LUM with dummy 1.0 values on the diagonal and 0.0 values in the upper right. The Upper has the upper right corner of the LUM, including the diagonal, and 0.0 values in the lower left.

OK, now if you multiply Lower times upper you get (drum roll please):

5.0 4.0 7.0 1.0 1.0 8.0 4.0 2.0 2.0 1.0 9.0 3.0 3.0 7.0 2.0 5.0

Ooops. Lower * Upper isn’t quite the original matrix M. Some of the rows have been rearranged. Well, as it turn out, the decomposition, in addition to returning the combined LUM matrix, also returns a perm (permutation) vector that tells you how to rearrange the rows of Lower * Upper in order to get the original matrix M back. For my example, the perm vector is:

3 1 2 0

This means row [3] of Lower * Upper goes to row [0], and so on. I wrote a helper function called applyPerm(m, perm) to programmatically switch the rows.

Well, there’s actually quite a few more details about matrix decomposition, but if you found this blog post looking for a simple explanation, you should have a pretty good grasp of the main ideas.

*Artist Thomas Schaller specializes architecture-related scenes, using watercolor. A very difficult technique to get right.*

The weighted k-nearest neighbors (k-NN) classification algorithm is a relatively simple technique to predict the class of an item based on two or more numeric predictor variables. For example, you might want to predict the political party affiliation (democrat, republican, independent) of a person based on their age, annual income, gender, years of education and so on. In the article I explain how to implement the weighted k-nearest neighbors algorithm using Python.

I present a demo program that uses dummy data with two numeric input values and three classes (0 = red, 1 = blue, 2 = green) to predict. The k-NN algorithm is very simple. First you select a value for k (like k = 4). Then, for the item to predict the class of, you examine your training data and find the 4 (or whatever k is) nearsest items. Suppose those four nearest items have classes red, blue, red, red. Your prediction is “red”.

But there are a lot of details. How do you measure “closest”? How do you give more weight to training items that are very close to the item-to-predict? How do you deal with predictor variables that aren’t numeric?

The weighted k-NN classification algorithm has received increased attention recently for two reasons. First, by using neural autoencoding, k-NN can deal with mixed numeric and non-numeric predictor values. Second, compared to many other classification algorithms, notably neural networks, the results of weighted k-NN are relatively easy to interpret. Interpretability has become increasingly important due to legal requirements of machine learning techniques from laws such as the European Union’s GDPR.

]]>

// update input-to-hidden weights for (let i = 0; i < this.ni; ++i) { for (let j = 0; j < this.nh; ++j) { let delta = -1.0 * lrnRate * ihGrads[i][j]; this.ihWeights[i][j] += delta; this.ihWeights[i][j] += momRate * ihPrevWtsDelta[i][j]; ihPrevWtsDelta[i][j] = delta; // save delta for next time } }

This code updates ihWeights[i][j] which are the weights that connect input node [i] to hidden node [j]. Each weight is incremented by a quantity delta which depends on the gradient for the i-j weight times a small value such as 0.01 which is the learning rate. This adding delta update decreases the error of the network. (The gradients are computed earlier, and that’s the tricky part).

Momentum adds a second quantity to each weight. The momentum quantity is the value of the previous delta times a small value like 0.6 which is the momentum rate. The effect of this is that as long as the updates are improving, the amount added each times increases.

Now eventually, each weight value will overshoot an optimal value, and the updates will change signs and reduce the value. But experience has shown that this process speeds up training in the long run.

Suppose you’re blindfolded and are trying to reach a goal that is, unknown to you, 100 feet ahead. You could take one step and check to see if you’re at the goal yet, Then take another step, and so on until you get close enough to the goal. This is like regular back-propagation training.

Or you could take one step, and determine you’re not at the goal yet. Next you take two steps. Then four steps, then eight steps and so on. Eventually (rather quickly) you would overshoot your goal that’s 100 feet ahead, and then you’d turn around and go back until you reach the goal. This is like momentum. Your progress might look something like:

at 0, take 1 step, at 1 at 1, take 2 steps, at 3 at 3, take 4 steps, at 7 at 7, take 8 steps, at 15 at 15, take 16 steps, at 31 at 31, take 32 steps, at 63 at 63, take 64 steps, at 127 at 127, go back 32 steps, at 95 at 95, take 4 steps, at 99 close enough, stop.

This is highly simplified and just an analogy of course but should help you understand how momentum speeds up training.

*Four famous pinball machines. “Baffle Ball” (1931) – First widely successful machine, no flippers. “Humpty Dumpty” (1947) – First machine with flippers. “Knock Out” (1950) – Very high quality machine for its time. “Creature from the Black Lagoon” (1992) – Considered by fans to be one of the top-50 best modern machines.*

The goal of a binary classification problem is to predict something that can take on one of just two possible values. For example, you might want to predict the sex (male or female) of a person based on their height, annual income, and credit rating. There are many different techniques you can use for binary classification. Older techniques include logistic regression, support vector machine, naive Bayes, k-nearest neighbors, decision trees, and several others.

Using a neural network for binary classification has several advantages over other techniques. But NNs have disadvatages too, most notably they 1.) require lots of training data, and 2.) the resulting model is difficult to interpret (the model that is, not a specific prediction).

It’s possible to write neural network code from scratch using a language like Java, C#, or Python. But writing from scratch is very time consuming. So, the usual approach is to use a neural network code library. Amonmg my colleagues, the most common libraries are Keras, TensorFlow, PyTorch, scikit-learn, and MXNet, but there are dozens more.

Keras operates at a relatively high level of abstraction which means that you can get a system up and running fairly quickly. The disadvantage of Keras is that it’s moe difficult to customize than TensorFlow and PyTorch. Keras uses the Python language which is by far the most common language for ML. And note that Keras actually relies on TensorFlow.

For my talk, I used the Cleveland Heart Disease dataset. The goal is to predict whether a patient will develop heart disease based on 13 predictor variables (age, sex, cholesterol level, etc.)

Even though the program code is quite short, it’s extremely dense, meaning there are many ideas associated with each line of code. The output of the trained neural network is a single value between 0.0 and 1.0 where a value of less than 0.5 means a prediction of no heart disease, and an output value greater than 0.5 means heart disease.

*Two images from my “What the Heck” collection. (Click to enlarge image) Left: “Legendary blind chess player Zsiltzova Lubov looks over the game of Anna Stolarczyk” (!) Right: Perhaps the best newspaper headline in history.*

This is similar to the NumPy choice() function. As usual, even the simplest problem has dozens of different possible designs. Because my select() function needs to use random functionality, I decided to place select() inside a random class, as opposed to having select() as a standalone function that either a.) assumes the existence of a global random object, or b.) has a rndObj parameter, or c.) instantiates a rndObj object in the body. Whew! design decisions are a lot more complex in some sense than implementation.

My implementation is simple. First I generate an array with values from 0 to N-1, for example [0, 1, 2, 3, 4, 5, 6]. Then I use the Fisher-Yates shuffle to randomly scramble the order, for example [4, 0, 2, 6, 1, 5, 3]. Then I mask off the first n values, for example [4, 0, 2].

A different algorithm to select n random items from N is called reservoir sampling. Reservoir sampling is useful when N is very large (say, over 100,000). If you search this blog site for “reservoir” you’ll find several posts on the technique.

Good fun.

*Before the rise of CGI technology in movies, amazing special effects were created by using large matte (“masked”) paintings on sheets of glass. From left: “Return of the Jedi” (1983), “Spartacus” (1960), “Raiders of the Lost Ark” (1981), “Earthquake” (1974).*

First, there’s general cross entropy error, which is used to measure the difference between two sets of two or more values that sum to 1.0. Put more accurately, cross entropy error measures the difference between a correct probability distribution and a predicted distribution.

Suppose correct = (0.2, 0.5, 0.3) and predicted = (0.1, 0.3, 0.6). The basic math cross entropy error is computed as:

ce = -[0.2 * log(0.1) + 0.5 * log(0.3) + 0.3 * log(0.6)] = -[0.2 * -2.30 + 0.5 * -1.20 + 0.3 * -0.15] = -[0.46 + -0.60 + -0.15] = 1.22

In words, “the negative sum of the corrects times the log of the predicteds.”

But in machine learning contexts, the correct probability distribution is usually a vector with one 1 value and the rest 0 values, for example, (0, 1, 0). So if correct = (0, 1, 0) and predicted = (0.1, 0.7, 0.2) then:

ce = -[0 * log(0.1) + 1 * log(0.7) + 0 * log(0.2)] = -[0 + (1 * -0.36) + 0] = 0.36

Notice that when the correct probability distribution has one 1 and the rest 0s, all the terms in the cross entropy calculation drop out except for one because of the multiply by zero terms. As a minor detail, note that if the predicted probability values that corresponds to the correct distribution 1 value, you’d have a log(0) term which is negative infinity so a code implementation would blow up.

Now, unfortunately, binary cross entropy is a special case for machine learning contexts but not for general mathematics cases. Suppose you have a coin flip with correct = (0.5, 0.5) and predicted = (0.6, 0.4). This is a binary problem because there are only two outcomes. Binary cross entropy can be calculated as above with no problem.

Or suppose you have a different ML problem with correct = (1, 0) and predicted = (0.8, 0.2). Again you can compute binary cross entropy in the usual way without any problem.

But, alas, in machine learning contexts, binary correct and predicted values are almost never encoded as pairs of values. Instead they’re stored as just one value. The idea is that for a binary problem if you know one probability, you automatically know the other. For example, suppose predicted is (0.7, 0.3). All you need to store is the 0.7 value because the other probability must be 1 – 0.7 = 0.3.

So, for a binary classification problem where computed and correct probabilities are stored as single values, the standard definition of cross entropy can’t be applied. Instead, binary cross entropy becomes just, “the log of the single predicted value.”

So, what’s the point? If you’re implementing ML code from scratch, and you use the standard technique of storing just single values for binary problems, you’ve got to have two code paths – one for binary problems and one for problems with three or more values. If you’re implementing ML code using a code library, you’ve got to carefully read the library documentation to see how the library deals with binary cross entropy error.

*Four images from “Entropy and Art: The View Beyond Arnheim”, Matilde Marcolli. See http://www.its.caltech.edu/~matilde/EntropyArtChapter.pdf. I understand machine learning very well. I don’t understand art very well, and I don’t think anyone can.*

The best way to implement this as a function is very simple. For example, in Python:

def apply_perm_simple(arr, perm): n = len(arr) result = np.empty(n, dtype=object) for i in range(n): result[i] = arr[perm[i]] return result

End of story. Almost.

Notice the code above creates a result array and then copies into that array. An old interview question asks how to apply a permutation to an array in place, without using a result array. It’s possible, for example:

def apply_perm_clever(arr, perm): n = len(arr) for i in range(n): j = perm[i] while j < i: j = perm[j] swap(arr, i, j) return arr def swap(arr, i, j): tmp = arr[i] arr[i] = arr[j] arr[j] = tmp

The algorithm is quite subtle and the only way to really understand it (for me at least) is to walk through the code line by line and examine the values of i, j, and the source array.

There is a slightly different version of this algorithm that reduces the number iterations in the while-loop at the expense of having to track changes by modifying the perm parameter.

I coded a demo where I set an array = (“cow”, “dog”, “bat”, “ant”, “gnu”, “elk”, “fox”). I used the built-in NumPy argsort() function to get a permutation of (3, 2, 0, 1, 5, 6, 4) which when applied to the array will give an array that’s sorted in alphabetical order which makes it easy to tell if the demo code is working correctly or not.

This is an interesting little problem. Let me emphasize that the “clever” approach would never be used in practice — it’s just way too complicated.

*Three images from an Internet search for “too complicated”. Left: fashion. Center: music. Right: movie plot (“Predestination”, 2014). Interesting.*

I’ve spoken at Interop many times. The conference is interesting to me in part because it has changed so radically over the past several years. The conference used to be all about computer networks and IT-related topics, especially hardware and management software.

But networks and IT evolved very quickly from 2010 to 2015, from requiring a lot of manual work to requiring very little manual work. In fact, the number of IT professionals declined significantly over that time period.

*A couple of photos I took at the 2018 event.*

So, to maintain relevance, the Interop event had to evolve quickly too. Put somewhat differently, every year when I attended Interop, the event seemed different. In particular, over the past couple of years, I’ve seen a quite dramatic increase in topics related to machine learning. Overall though, I’d say that security has been, and continues to be, the most dominant topic.

I wouldn’t be speaking at Interop and I wouldn’t be writing this blog post if I didn’t think the event was good — where good means a valuable use of my time. In addition to good talks from the event speakers, I gain extremely useful information from impromptu chats with other attendees.

Interop is run by UBM, a company that runs business-to-business events for many sectors in addition to technology. This means that Interop doesn’t push one particular company’s products and services. Now to be sure, there are many companies at Interop, and they’re all promoting their services, but that’s a good thing.

As a counter example, one event that I found wasn’t a particularly good use of my time was last year’s VMware conference because it was really one gigantic advertisement. (And had pretty much totally incompetent people running the event).

My talk at Interop will be “Understanding Deep Neural Networks”. I’ll explain exactly what neural networks are, what they can and cannot do, and describe ways that neural networks can be used. If you attend 2019 Interop, be sure to seek me out and say “hello”.

*Mirage – desert. Four images from an Internet search for “desert planet”.*

I used the standard Iris Dataset example, where the goal is to predict species (setosa, versicolor, or virginica) from four numeric predictors (sepal length and width, petal length and width).

I’ve implemented neural networks using several different programming languages (C, C#, Python, R) but I had forgotten how tricky and difficult the code is. But in the end I got the training function to work.

In order to get the train() function to work, I needed to implement several helper functions. I wrote a loadTxt() function to read data into memory. I wrote a program-defined class to generate seedable pseudo-random numbers. I wrote a shuffle() function to generate a random order for processing the training data. I wrote a meanSqErr() function to compute the current mean squared error so I could monitor error during training. And there were a few more minor functions too.

In retrospect, one of the main challenges when implementing a neural network from scratch is that there are a lot of helper functions to deal with.

Implementing a simple neural network from scratch is feasible, but not easy. By simple I mean a single hidden layer. But for two or more hidden layers, implementation gets extremely difficult.

*Doc Savage was a precursor to today’s superheroes. He didn’t have special powers but he was a super-brilliant scientist with amazing athletic skills. He faced many challenges but had five friends who helped him. In the 1930s and 40s he appeared in a monthly magazine, with stories by Lester Dent and art by Walter Baumhofer. The character was mostly forgotten in the 1950s but then had a second wave of popularity in the 1960s and 70s when the stories were re-published in paperback form. Same stories but attributed to a fictitious “Kenneth Robeson” and new art by James Bama featuring truly weird hair (which I didn’t like at all). I’ve read a few of the stories and found them mildly interesting from a historical/cultural perspective.*

Now, suppose someone applies for a loan from you but your model predicts p = 0.65 and you reject the loan. And you want to explain why the loan was rejected. One way to explain a prediction model is to use a counterfactual. An example might be, “If x1 had been increased to 71.0 and x2 had been decreased to 53.0 and x3 stayed the same, your loan would have been approved.” Notice this is sort of an indirect explanation rather than a direct explanation of how the model works.

One immediately obvious problem with this counterfactual technique is that there could be many counterfactuals. How do you choose the best one?

By far the best explanation of counterfactual interpretability I’ve seen is at: https://christophm.github.io/interpretable-ml-book/counterfactual.html. It describes a technique suggested by Wachter, Sandra, Brent Mittelstadt, and Chris Russell in “Counterfactual Explanations Without Opening the Black Box: Automated Decisions and the GDPR” (2017) for choosing just one of the counterfactuals.

Basically you find the one counterfactual that has the least change in the xi predictor values. But the details are quite tricky. For example, suppose a loan applicant had: x1 = 65.0, x2 = 60.0, x3 = 35.0 and two counterfactuals are:

(1) if x1 increased to 71.0 and x2 decreased to 58.0 and x3 increased to 38.0

(2) if x1 increased to 70.0 and x2 stayed at 60.0 and x3 increased to 36.0

then counterfactual (2) is better because its changes in the xi are smaller. I’ve left out a ton of details but this example gives you a rough idea. To be honest, I think that there’s a lot of room for improvement in techniques for picking a best counterfactual explanation of a model’s results. For example, the technique described in the research paper above doesn’t really handle categorical data very well. I speculate that using a neural autoencoder preprocessor on the predictor values would give better results.

The topic of model interpretability is quite interesting and will become more important as machine learning prediction models increasingly affect people’s lives.

*Four humorous observations related to “interpretability” by the brilliant Gary Larson. Click to enlarge.*