Neural Network Weight Decay and Restriction

I wrote an article in the July 2014 issue of MSDN Magazine titled “Neural Network Weight Decay and Restriction”. See


You can think of a neural network as a complicated math function that accepts one or more numeric input values, and generates one or more numeric output values. The output values are determined in part by a set of values called the networks weights and biases. Training a neural network is the process of finding the values for the weights and biases.

Training is accomplished by using a set of training data that has known output values. Training finds the set of weights and bias values so that when presented with training data input values, the computed output values are very close to the known output values in the test data.

A major problem when training a neural network is a phenomenon called over-fitting. Over-fitting means you’ve found weights and bias values that work perfectly or almost perfectly with the test data, but when you present the trained network with new data, the prediction accuracy is very poor.

There are several strategies you can use to try and deal with over-fitting. One is called weight decay; during training, each weight and bias value is decremented by a small amount. The idea is that if there is no counter effect from the training data, useless weights will fade away. Another strategy is weight restriction; during training weights and bias values are not allowed to get too big or too small.

Posted in Machine Learning | Leave a comment

Neural Network Back-Propagation and De-Modularizing

I ran across an unusual scenario recently where it was beneficial to “de-modularize” some code. I’m not sure if “de-modularize” is a word or not, but I mean refactoring some code that was in two functions to one larger function. The code I was working with was neural network training using the back-propagation algorithm. My original code resembled this:

Method Train:

loop until done
    for each data item
      compute output values
      update weight values
    end for
  end loop

In other words, the training method was a wrapper around two methods, ComputeOutputs and UpdateWeights. Nice and modular. The update-weights method resembled:

Method UpdateWeights:

compute output gradients
  compute hidden gradients
  use gradients to compute deltas
  update weights using deltas
  modify weights using deltas from previous call

I’m leaving out a lot of details, but the problem with the modular approach was storing the previous-deltas. The previous deltas are stored in two matrices and two arrays. One approach is to put these four data structures as class members. But that’s ugly because it doesn’t make replacing the training method easy. A second approach is to place the data structures inside method Train. But this means I’d have to pass them as four additional parameters to method UpdateWeights or create a “state-context” data structure that holds all four data structures and pass it as a parameter.


In the end, the best solution was to de-modularize the code by ditching the UpdateWeights method and placing its code directly into method Train. Here’s the result:

public void Train2(double[][] trainData, int maxEpochs,
  double learnRate, double momentum)
  // integrated 'UpdateWeights' version 
  // back-prop specific arrays
  double[] oGrads = new double[numOutput]; // gradients
  double[] hGrads = new double[numHidden];

  // back-prop momentum specific arrays 
  double[][] ihPrevWeightsDelta = MakeMatrix(numInput,
  double[] hPrevBiasesDelta = new double[numHidden];
  double[][] hoPrevWeightsDelta = MakeMatrix(numHidden,
  double[] oPrevBiasesDelta = new double[numOutput];

  // train 
  int epoch = 0;
  double[] xValues = new double[numInput]; // inputs
  double[] tValues = new double[numOutput]; // targets

  int[] sequence = new int[trainData.Length];
  for (int i = 0; i < sequence.Length; ++i)
    sequence[i] = i;

  while (epoch < maxEpochs)
    double mse = MeanSquaredError(trainData);
    if (mse < 0.040) break;

    Shuffle(sequence); // random order
    for (int ii = 0; ii < trainData.Length; ++ii)
      int idx = sequence[ii];
      Array.Copy(trainData[idx], xValues, numInput);
      Array.Copy(trainData[idx], numInput, tValues, 0,
      //UpdateWeights(tValues, learnRate, momentum);

      // ---- Update-Weights section
      // 1. compute output gradients
      for (int i = 0; i < numOutput; ++i)
        // derivative for softmax = (1 - y) * y 
        double derivative = (1 - outputs[i]) * outputs[i];
        oGrads[i] = derivative * (tValues[i] - outputs[i]);

      // 2. compute hidden gradients
      for (int i = 0; i < numHidden; ++i)
        // derivative of tanh = (1 - y) * (1 + y)
        double derivative = (1 - hOutputs[i]) *
          (1 + hOutputs[i]);
        double sum = 0.0;
        for (int j = 0; j < numOutput; ++j)
          double x = oGrads[j] * hoWeights[i][j];
          sum += x;
        hGrads[i] = derivative * sum;

      // 3a. update hidden weights
      // weights can be updated in any order)
      for (int i = 0; i < numInput; ++i) // 0..2 (3)
        for (int j = 0; j < numHidden; ++j) // 0..3 (4)
          double delta = learnRate * hGrads[j] * inputs[i];
          ihWeights[i][j] += delta; 
          // now add momentum using previous delta.
          ihWeights[i][j] += momentum *
          ihPrevWeightsDelta[i][j] = delta; 

      // 3b. update hidden biases
      for (int i = 0; i < numHidden; ++i)
        double delta = learnRate * hGrads[i]; 
        hBiases[i] += delta;
        hBiases[i] += momentum *
          hPrevBiasesDelta[i]; // momentum
        hPrevBiasesDelta[i] = delta;

      // 4. update hidden-output weights
      for (int i = 0; i < numHidden; ++i)
        for (int j = 0; j < numOutput; ++j)
          double delta = learnRate * oGrads[j] *
          hoWeights[i][j] += delta;
          hoWeights[i][j] += momentum *
            hoPrevWeightsDelta[i][j]; // momentum
          hoPrevWeightsDelta[i][j] = delta; // save

      // 4b. update output biases
      for (int i = 0; i < numOutput; ++i)
        double delta = learnRate * oGrads[i] * 1.0;
        oBiases[i] += delta;
        oBiases[i] += momentum *
          oPrevBiasesDelta[i]; // momentum
        oPrevBiasesDelta[i] = delta; // save
      // ---- end Update-Weights
    } // each training item
  } // whil
} // Train2

De-modularizing makes the training method long — well over one page of code, which is usually bad. But in this rare case, the de-modularized version is superior.

Posted in Machine Learning | Leave a comment

The Normal Cumulative Density Function using C#

I was looking at a machine learning technique called probit (“probability unit”) classification. Probit classification is exactly like logistic regression classification except that where LR uses the logistic sigmoid function to compute output, probit uses the cumulative density function of the Gaussian (Normal) distribution. The CDF is the area under the curve of the Gaussian bell-shaped curve, from -infinity to some value z.


So, to write probit classification code, I needed to implement a CDF method using C#. Unlike the logistic sigmoid function which is very simple — 1.0 / (1.0 + Math.Exp(-z)) — there is no simple way to compute CDF. The most common approach is to use one of many close approximations. I used a famous equation 7.1.26 from a famous reference the “Handbook of Mathematical Functions” (1965), by Abramowitz and Stegun. The book is often just called “A&S”.

Actually, equation 7.1.26 is something called the “error function” (often called “erf”) but the CDF is just a slight modification of erf. Anyway, here’s one way to code CDF using C#:

static double CumDensity(double z)
  double p = 0.3275911;
  double a1 = 0.254829592;
  double a2 = -0.284496736;
  double a3 = 1.421413741;
  double a4 = -1.453152027;
  double a5 = 1.061405429;

  int sign;
  if (z < 0.0)
    sign = -1;
    sign = 1;

  double x = Math.Abs(z) / Math.Sqrt(2.0);
  double t = 1.0 / (1.0 + p * x);
  double erf = 1.0 - (((((a5 * t + a4) * t) + a3)
    * t + a2) * t + a1) * t * Math.Exp(-x * x);
  return 0.5 * (1.0 + sign * erf);

I wrote a little harness to spit out values of CDF for z values from -4.0 to +4.0 (every 0.25), and then copy-pasted the results into Excel and made a graph. I also put values for the logistic sigmoid on the graph to see how they compare.

As the graph shows, both functions range from 0.0 to 1.0. So, it isn’t a surprise that logistic regression classification and probit classification give pretty much the same results. As it turns out, logistic regression is easier to work with, in part because it is easily differentiable which makes model training using calculus based techniques possible. Different fields tend to use one or the other classification technique. For example, in economics, probit seems to be more common, but in other fields, logistic regression seems more common.

Posted in Machine Learning

A Digi-Comp ii Emulator and Simulation using Python

Digi-Comp ii was an amazing mechanical toy computer made in the 1960s. It worked by rolling marbles down a plastic surface filled with toggle switches. A company called Evil Mad Scientist recently released a beautiful wooden re-creation (below).


I decided to write a computer program to emulate (or simulate depending on your point of view) the Digi-Comp ii. My Python programming was a bit rusty because I’ve mostly been using C#, so I thought I’d use Python just for fun.


My demo program multiples 3 and 13. The calling code, with some minor edits to save space, is:

print "\nBegin Digi-Comp ii simulation \n"
dc = DigiComp()
dc.Initialize(35) # 35 balls in top reservoir
print "\nInitial state: "

# set up 3 * 13 problem
dc.q1 = 1; dc.q2 = 1; dc.q3 = 0
dc.m1 = 1; dc.m2 = 0; dc.m3 = 1; dc.m4 = 1
dc.multiply = 1
print "\nSettings to compute 3 * 13:"
# result in A should be 1110010

print "\nResult state (answer in A register):"

print "\nEnd Digi-Comp ii simulation \n"

The simulator is mostly a (not-shown) program-defined class named DigiComp. The program has about 250 lines of Python code. For example, here are two of the class methods:

def DoStart(self):
  if == 0:
    print "No balls in top reservoir"
  else: = - 1 # launch ball

def DoMultiply(self):
  if self.multiply == 0: self.DoClear() # go right
  elif self.multiply == 1: self.DoD2() # go left

The calling code sets Q (called MQ in the Digi-Comp manual) to 110 = 3d. Then it sets M to 1011 = 13d. The DoStart method launches the first virtual ball. The result is in the A register and is 1110010 = 1 + 2 + 4 + 32 = 39d. All in all, it was a fun little exercise.

Posted in Machine Learning, Miscellaneous

My Top Ten Favorite Twilight Zone Episodes

I’m writing this blog post on the 4th of July. There’s a Twilight Zone marathon on TV (as usual) so I thought I’d list my favorite episodes so you can see if my list has any of your favorites.

1. It’s a Good Life – This is the episode where a young boy, Anthony, has the power to “send people to the cornfield”. He holds his neighbors as virtual hostages. Jack-in-the-boxes are creepy.


2. Nightmare at 20,000 Feet – William Shatner is on an airplane during a storm and thinks he sees some sort of creature on the wing of the airplane. Is he hallucinating or not?


3. Living Doll – Telly Savalas as an evil stepfather does not like his daughter’s Talky Tina doll. The feeling is mutual. “My name is Talky Tina and I’m going to kill you.”


4. Will the Real Martian Please Stand Up? – A group of bus passengers are stranded in a café during a snowstorm. One of them is a Martian scout in advance of an invasion. Why is the café owner not worried?


5. The Howling Man – An American traveler seeks shelter in a remote European castle during a storm. He discovers that Brother Jerome is holding a man captive in a dungeon. Jerome claims the prisoner is Satan. The prisoner claims Jerome is insane. One of them is right.


6. Death Ship – Three astronauts land on an alien planet and find a crashed spaceship identical to theirs – complete with exact replicas of dead versions of themselves! Are they really dead? Is it alien mind control? An alternate dimension?


7. To Serve Man – A group of seven-foot tall aliens come to earth bearing gifts. Two skeptical code breakers try to discover if the aliens are benign or just hungry.


8. The Jeopardy Room – A Soviet agent (played by Martin Landau) is attempting to defect to the West but is trapped inside a dingy hotel room by an assassin. The assassin tells him the room is rigged with a bomb. Is the trigger hooked to the door knob? The window latch?


9. The Silence – At an old school men’s club, an aristocratic member is annoyed by a new-money member who is constantly talking. They bet $500,000 the talky guy can/cannot stay silent for exactly one year. A double-twist ending.


10. Mirror Image – A woman waiting at a bus stop experiences weird events. She thinks she sees her exact twin. Is the twin real? Is the twin an evil replica from another dimension? Or is she just having a nervous breakdown?


Posted in Top Ten

Distorting the MNIST Image Data Set

I wrote an article titled “Distorting the MNIST Image Data Set” in the July 2014 issue of MSDN Magazine. See


The NMIST (mixed National Institute and Standards and Technology) image data set is a collection of small (28 by 28 pixels) black-and-white images of handwritten 0-9 digits. There are 60,000 images in a training set and another 10,000 images in a test set.

The data sets were created so that image recognition researchers and practitioners can test new algorithms against the set, and compare their results with other results.

Even 60,000 training images aren’t very many, so a clever idea is to create additional training images from the original images, by slightly distorting each image. For example, the screenshot above shows the third training image, which happens to be a ‘4’, and a distorted version.

Distorting images is surprisingly tricky. There are many possibilities, but the technique I demonstrate uses a Gaussian kernel, which is an interesting topic in its own right.

Posted in Machine Learning

Selected Software Development Conferences in 2014

There are a lot of conference for software developers. As I’m writing this blog post, year 2014 is about half-way over, so here are a few significant (meaning only that I know about them) software development conferences scheduled for the second half of 2014. If you know any I missed, tell me at jamesDmccaffrey -at-

OSCON (Open Source Convention) – As the title suggests, this is for developers who use Python and other open source technologies. Attendance varies but is typically about 2,500. Run by O’Reilly Media who does the strange animal cover books. Recommended if you use open source software. Price: about $2500. (Yikes.)


July 20-24, 2014, Portland, Oregon

Visual Studio Live – This conference has been around for many years. Typically has 500 – 1500 attendees and a significant Microsoft presence. Run by 1105 Media who publishes Visual Studio Magazine. Highly recommended. Price: about $2100.


August 18-22, 2014, Redmond, Washington
October 6-9, 2014, Washington, DC
November 17-21, 2014, Orlando, Florida

IT/DevConnections – Another long-running conference. Typically 500 – 2000 attendees and a moderate Microsoft presence. Run by Penton Media who publishes Windows IT Pro Magazine. Strongly recommended. Price: about $1900.


September 15-19, 2014, Las Vegas, Nevada

JavaOne Conference – Run by Oracle, a company I’m not very fond of, but this conference gets good reviews. Many thousands of attendees. Recommended, if you use Java, obviously. Price: about $2000.


September 28 – October 2, 2014, San Francisco, CA

Better Software Conference – Run by a company called Software Quality Engineering (SQE). Attendance typically about 500 – 1000 attendees. In my opinion, the speaker quality at SQE events has been very poor in recent years. Not recommended at this time. Price: about $1900.


November 9-14, 2014, Orlando, Florida

DevIntersection Conference – A relatively new (2014 will be their third year) independent conference run by some of the former managers of the DevConnections Conference. I expect roughly 2500 attendees. Has a strong Microsoft presence. Very highly recommended. Price: about $1800 (Excellent value).


November 10-13, 2014, Las Vegas, Nevada

App Dev Trends Conference – A brand new conference. Typically, a first year conference draws about 300-600 people, but you can often get great pricing. Run by 1105 Media so the event will be professionally handled. I intend to go, so in other words, recommended. Price: about $1900.


December 8-11, 2014, Las Vegas, Nevada

Posted in Machine Learning, Miscellaneous, Software Test Automation