Neural Network Training using Simplex Optimization

I wrote an article titled “Neural Network Training using Simplex Optimization” in the October 2014 issue of Visual Studio Magazine. See A neural network is like a complicated math equation that has variables and coefficients. Training a neural network is the process of finding the values for the coefficients (which are called weights and biases).


To find good values for weight and biases, you use training data that has known inputs and output values. You want to minimize the error between computed outputs and actual outputs. This is called a numerical optimization problem.

There are about a dozen or so common numerical optimization techniques that can be used to train a neural network. By far the most common technique is called back-propagation. Another technique that is becoming increasingly popular is called particle swarm optimization.

One of the oldest numerical optimization techniques is called simplex optimization. A simplex is a triangle. Simplex optimization uses three candidate solutions. There are many variations of simplex optimization. The most common is called the Nelder-Mead algorithm. My article uses a simpler version of simplex optimization that doesn’t have a particular name.

Simplex optimization is also known as amoeba method optimization. Not because it mimics the behavior of an amoeba, but rather, because if you graph the behavior of the algorithm, which is based on geometry, it looks like a triangle oozing across the screen, vaguely looking like an amoeba.

Posted in Machine Learning | Leave a comment

The Logit Log-Odds Function in Machine Learning

This week I was working with the logit function, also known as the log-odds function. There are plenty of deep math explanations of the logit function, but I think most descriptions miss the main point.

The probability of an event, p, is a number between 0 and 1 that is a measure of how likely the event is. The bottom line is that a logit function result is (almost) a number between -4 and +4 that is a measure of how likely an event is. I say “almost”, because in theory a logit result can be from -infinity to +infinity, but in most situations the result is between about -4 and +4, and in the majority of those situations the result is between -2 and +2.

In other words, probability and logit values describe how likely an event is.

The definition of the logit function is

logit(p) = log(p / (1-p))

Notice that the only real information in the logit function is a probability, so logit cannot supply more information than probability. The p / 1-p term is the odds of an event. For example, if the probability of some event is 0.75, then the odds of the event are 0.75 / (1 – 0.75) = 3 / 1 or “three to one odds”. So logit is just the log of a probability expressed as odds, hence the name log-odds, which was shortened to “logit”.

Here’s what the logit function looks like (the tails go off to infinity):


So, why use the logit function at all? There are two reasons why the logit function might be used. First, because a logit value that is negative is less than 50% likely, and a logit value that is positive is more than 50% likely, logit values are easy to interpret by eye for some problems. The second reason is that, because of properties of the math log function, two logit values can sometimes be easier to compare than the two associated probabilities. I don’t really buy either reason to be honest — I prefer to use probabilities.

Final notes: the logit function is the math inverse of the logistic sigmoid function:

logistic(z) = 1.0 / (1.0 + e^-z)

The logistic sigmoid function has many uses in machine learning. And, the logistic sigmoid function is closely related to tanh, the hyperbolic tangent function, another common ML function, especially with neural networks. The relationship between logistic and tanh is:

tanh(z) = 2 * logistic(2z) – 1

logistic(z) = (tanh(z/2) + 1) / 2

In short, the logit, logistic sigmoid, and tanh functions are all related to each other and are conceptually based on probability.

Posted in Machine Learning | Leave a comment

Probit Classification using C#

In machine learning, a classification problem is one where you want to predict something, where the something takes on a class value (such as “died” or “survived”) as opposed to a strictly numeric value (such as blood pressure). The variables used to make the prediction are called the features, or the independent variables. For example, to predict whether or not a hospital patient will die or survive, you might use features age, sex, and kidney-test score.


There are several ML classification techniques, for example, logistic regression classification, neural network classification, decision tree classification, and naive Bayes classification. Different classification techniques tend to be suited to different types of problems.

I wrote an article titled “Probit Classification using C#” in the October 2015 issue of MSDN Magazine. See Probit classification is very similar to logistic regression classification. Probit stands for “probability unit” because the result of probit classification is a number between 0 and 1 which can be interpreted as a probability.

Probit classification isn’t used as often as other classification techniques, except by analysts who work in finance and economics. I believe this is mostly for historical reasons. Probit classification tends to give results that are pretty much the same as logistic regression classification.

Posted in Machine Learning

A Recap of Science Fiction Movies of 2013

The year 2013 is now long passed so I figured I’d review the science fiction films released that year. Here are 10 significant (meaning only that I saw them) sci-fi movies from 2013, with my ratings, from best to worst. I didn’t include super hero movies like Iron Man 3, Man of Steel, Thor: The Dark World, and The Wolverine because they belong in a separate category in my mind.

1. Gravity – Grade = A. Sandra Bullock and George Clooney adrift in space. This movie had me on the edge of my seat from the first scene to the last. No real plot to speak of, and not much character development, so I can understand why a lot of people don’t like this movie so much. But I loved it.


2. Oblivion – Grade = A-. Tom Cruise as one of the few humans on Earth after a war with aliens. I rarely have high expectations for a Tom Cruise vehicle, but I was very pleasantly surprised. Very clever plot. At the end, I found myself saying, in a good way, “Why didn’t I see that coming!?”


3. The Hunger Games: Catching Fire – Grade = B. Jennifer Lawrence competes against previous champions in a fight to the death. I didn’t like the first Hunger Games (2012) movie at all. I was forced to see this film and it was another pleasant surprise. This second Hunger was a lot less trite and clichéd than the first.


4. Star Trek into Darkness – Grade = B. Kirk versus Khan. Again. A third 2013 film that exceeded my expectations. I liked the first Chris Pine as Kirk Star Trek (2009) a lot, but sequels can be iffy propositions, so didn’t know what to expect here. I prefer this sequel to the 2009 film. Good action combined with intelligent plot.


5. Ender’s Game – Grade = B-. Space cadet Ender Wiggin destroys an alien species then saves their last egg. Many of my friends are huge fans of the book and so were rather disappointed with this somewhat lackluster film. Not a bad film, just not a really good film.


6. Pacific Rim – Grade = C. Giant robots are created to battle giant alien monsters emerging from some other dimension or something, on the ocean floor. I liked this much better than I thought I would. A good friend of mine, Ken L., is a often perfect negative indicator for me. Usually, movies that he likes a lot, like Sucker Punch (2011), are not my favorites, and movies I like, typically leave him unimpressed. So, when Ken raved about Pacific Rim, I was wary. But it was much better than I thought it’d be.


7. Elysium – Grade = C. Matt Damon champions the downtrodden on Earth against the elite citizens in space. I like most Matt Damon films. And the story sounded intriguing. But the movie just didn’t do much for me. The entire story seemed a bit illogical and too far-fetched to me, which I know is weird when I can suspend disbelief for other films.


8. Europa Report – Grade = C. Found video footage reveals how a mission to Jupiter’s moon Europa went wrong. Not a bad low-budget little movie. Seemed fairly realistic to me, but the story dragged a bit.


9. Riddick – Grade = C. Arg! Vin Diesel doing a lot of staring and a lot of fighting. I really like Chronicles of Riddick (2004) and had high hopes. But, alas, this part III just didn’t come together for me. Just a little too slow, a little too lame. One of those films where the whole is less than the sum of its parts.


10. After Earth – Grade = F. An incredibly annoying Jaden Smith and a moderately annoying Will Smith roaming randomly in a thoroughly annoying movie. Bad movie. Very bad movie. It’s hard for a movie to be both boring and not make sense because of the action, but this film did. Epic bad.


Posted in Top Ten | 1 Comment

Creating Neural Networks using Azure Machine Learning Studio

Several weeks ago, Microsoft released a new tool and system to create machine learning models. I wrote an article titled “Creating Neural Networks using Azure Machine Learning Studio” in the September 2014 issue of Visual Studio Magazine. See


The system is Cloud-based (on Microsoft Azure). The backend, where computations are performed, is called Microsoft Azure Machine Learning (sometimes abbreviated MAML). The front-end UI part of the system is a Web application called Machine Learning Studio (ML Studio).

In the article, I describe, step by step, how to create a neural network model that predicts the species (either “setosa”, “versicolor”, or “virginica”) of an iris flower, based on four numeric features: sepal length, sepal width, petal length, and petal width. A sepal is a green leaf-like structure.

ML Studio is an almost completely drag and drop system. You drag items that represent either data or actions on data (functions or methods to a programmer) onto a design surface and then connect the modules.

Fig1- IrisExperiment

The graphical approach is much, much faster than creating a prediction model using code. On the downside, the SDK for the system has not yet been released so you can’t write custom modules, meaning you can only do whatever the built-in modules can do. An analogy is Lego. With a lot of Lego modules you can build a lot of cool things. But if you had some machine to design and create custom Lego pieces (like an SDK), you could build anything.

It will be interesting to see if Azure ML gains traction among developers, business analysts, and data scientists. I think Azure ML is very cool, but the technology landscape is littered with the carcasses of great technologies that never caught on because of bad marketing or bad timing or just bad luck.

Posted in Machine Learning

Dev Connections Conference 2014 Recap

I spoke for Microsoft at the 2014 Dev Connections software and IT conference this week (September 15-19). I gave two talks, “Introduction to Speech Recognition with C#” and “Developing Neural Networks with Visual Studio”. I’d estimate there were about 1,500 attendees at the conference. One guy I talked to at a lunch was a typical attendee: he worked for a mid-sized regional bank as sort of a Jack of all trades, doing IT tasks, and also developing line of business applications (both desktop and Web). My strategic goal was to educate attendees about Microsoft’s expertise and thought leadership with machine learning. My tactical goal was to demonstrate to developers that using the Microsoft technology stack (Visual Studio, C#, Azure, etc.) is a great way (actually, the best way in my honest opinion) to extract usable information from data.


The conference had well over 200 one-hour talks. In each hour time slot, there were between 12 and 20 talks for attendees to choose from. Each room held about 120 people, plus a handful of double-sized rooms.

The conference was at the Aria hotel in Las Vegas. Las Vegas is my favorite place for conferences. It’s relatively inexpensive, and easy to get to from almost anywhere. The hotels are huge (I read that 15 of the largest 25 hotels in the world are in Vegas) with enormous convention areas so there’s no need for attendees to be scattered across a dozen different hotels and have to bus or taxi to a dedicated convention center (such as in San Francisco). Also, the town is walking-friendly, close to the airport (about a $25 fare), and has lots of things to see.

If you haven’t been to a conference in Las Vegas before, you might not have an accurate idea of what goes on. These aren’t party-time morale events. You typically get up very early in the morning, and then attend in-depth technical sessions all day long. It’s actually quite exhausting, but in a pleasant way if you’re a geek like me.


Developer and IT conferences like Dev Connections are fairly expensive, typically from $1700 to $3600 for a 3 to 5 day event. Are they worth the price? In my opinion, yes. The true value doesn’t really come from the content in the conference sessions — much of conferences’ content is available online. The value comes from being exposed to new ideas and techniques that you just don’t have time to discover during day to day work tasks. Without crunching any numbers, I’d estimate that a developer who attends one of these conferences will pay back his company, in terms of increased and improved productivity, far more than the cost of attending.

Posted in Machine Learning

Graphing Ackley’s Function using SciLab

Ackley’s function is a standard benchmark mathematical function used to test numerical optimization algorithms. The function is defined in two dimensions as:

f(x,y) = -20 * exp(-0.2*sqrt(0.5*(x^2 + y^2))) -
           - exp(0.5*cos(2*pi*x) + cos(2*pi*y)))
           + 20 + e

The function has a minimum value of 0.0 located at x = 0, y = 0. I wanted to graph the function so I used SciLab, a free program that is similar to the very expensive MatLab program.

Here are the five SciLab commands I issued:

-->z=-20 * exp(-0.2*sqrt(0.5 * (x.^2 + y.^2)))
     - exp(0.5 * (cos(2*%pi*x) + cos(2*%pi*y)))
     + 20 + %e;

The first command sets up a matrix of (x,y) values from -4 to +4, 0.10 units apart. The second command is the definition of Ackley’s function for two variables. Note the use of the .^ operator rather than the ^ operator. The third through fifth commands create the graph.


Notice that Ackley’s function has many local minimum values, which makes finding the optimal solution at (0,0) a bit tricky for some numerical optimization algorithms.

Posted in Machine Learning