Neural Network Back-Propagation and De-Modularizing

I ran across an unusual scenario recently where it was beneficial to “de-modularize” some code. I’m not sure if “de-modularize” is a word or not, but I mean refactoring some code that was in two functions to one larger function. The code I was working with was neural network training using the back-propagation algorithm. My original code resembled this:

Method Train:

loop until done
    for each data item
      compute output values
      update weight values
    end for
  end loop

In other words, the training method was a wrapper around two methods, ComputeOutputs and UpdateWeights. Nice and modular. The update-weights method resembled:

Method UpdateWeights:

compute output gradients
  compute hidden gradients
  use gradients to compute deltas
  update weights using deltas
  modify weights using deltas from previous call

I’m leaving out a lot of details, but the problem with the modular approach was storing the previous-deltas. The previous deltas are stored in two matrices and two arrays. One approach is to put these four data structures as class members. But that’s ugly because it doesn’t make replacing the training method easy. A second approach is to place the data structures inside method Train. But this means I’d have to pass them as four additional parameters to method UpdateWeights or create a “state-context” data structure that holds all four data structures and pass it as a parameter.

DeModularizingBackPropagation

In the end, the best solution was to de-modularize the code by ditching the UpdateWeights method and placing its code directly into method Train. Here’s the result:

public void Train2(double[][] trainData, int maxEpochs,
  double learnRate, double momentum)
{
  // integrated 'UpdateWeights' version 
  // back-prop specific arrays
  double[] oGrads = new double[numOutput]; // gradients
  double[] hGrads = new double[numHidden];

  // back-prop momentum specific arrays 
  double[][] ihPrevWeightsDelta = MakeMatrix(numInput,
    numHidden);
  double[] hPrevBiasesDelta = new double[numHidden];
  double[][] hoPrevWeightsDelta = MakeMatrix(numHidden,
    numOutput);
  double[] oPrevBiasesDelta = new double[numOutput];

  // train 
  int epoch = 0;
  double[] xValues = new double[numInput]; // inputs
  double[] tValues = new double[numOutput]; // targets

  int[] sequence = new int[trainData.Length];
  for (int i = 0; i < sequence.Length; ++i)
    sequence[i] = i;

  while (epoch < maxEpochs)
  {
    double mse = MeanSquaredError(trainData);
    if (mse < 0.040) break;

    Shuffle(sequence); // random order
    for (int ii = 0; ii < trainData.Length; ++ii)
    {
      int idx = sequence[ii];
      Array.Copy(trainData[idx], xValues, numInput);
      Array.Copy(trainData[idx], numInput, tValues, 0,
        numOutput);
      ComputeOutputs(xValues);
      //UpdateWeights(tValues, learnRate, momentum);

      // ---- Update-Weights section
      // 1. compute output gradients
      for (int i = 0; i < numOutput; ++i)
      {
        // derivative for softmax = (1 - y) * y 
        double derivative = (1 - outputs[i]) * outputs[i];
        oGrads[i] = derivative * (tValues[i] - outputs[i]);
      }

      // 2. compute hidden gradients
      for (int i = 0; i < numHidden; ++i)
      {
        // derivative of tanh = (1 - y) * (1 + y)
        double derivative = (1 - hOutputs[i]) *
          (1 + hOutputs[i]);
        double sum = 0.0;
        for (int j = 0; j < numOutput; ++j)
        {
          double x = oGrads[j] * hoWeights[i][j];
          sum += x;
        }
        hGrads[i] = derivative * sum;
      }

      // 3a. update hidden weights
      // weights can be updated in any order)
      for (int i = 0; i < numInput; ++i) // 0..2 (3)
      {
        for (int j = 0; j < numHidden; ++j) // 0..3 (4)
        {
          double delta = learnRate * hGrads[j] * inputs[i];
          ihWeights[i][j] += delta; 
          // now add momentum using previous delta.
          ihWeights[i][j] += momentum *
            ihPrevWeightsDelta[i][j];
          ihPrevWeightsDelta[i][j] = delta; 
        }
      }

      // 3b. update hidden biases
      for (int i = 0; i < numHidden; ++i)
      {
        double delta = learnRate * hGrads[i]; 
        hBiases[i] += delta;
        hBiases[i] += momentum *
          hPrevBiasesDelta[i]; // momentum
        hPrevBiasesDelta[i] = delta;
      }

      // 4. update hidden-output weights
      for (int i = 0; i < numHidden; ++i)
      {
        for (int j = 0; j < numOutput; ++j)
        {
          double delta = learnRate * oGrads[j] *
            hOutputs[i];
          hoWeights[i][j] += delta;
          hoWeights[i][j] += momentum *
            hoPrevWeightsDelta[i][j]; // momentum
          hoPrevWeightsDelta[i][j] = delta; // save
        }
      }

      // 4b. update output biases
      for (int i = 0; i < numOutput; ++i)
      {
        double delta = learnRate * oGrads[i] * 1.0;
        oBiases[i] += delta;
        oBiases[i] += momentum *
          oPrevBiasesDelta[i]; // momentum
        oPrevBiasesDelta[i] = delta; // save
      }
      // ---- end Update-Weights
    } // each training item
        ++epoch;
  } // whil
} // Train2

De-modularizing makes the training method long — well over one page of code, which is usually bad. But in this rare case, the de-modularized version is superior.

Posted in Machine Learning | Leave a comment

The Normal Cumulative Density Function using C#

I was looking at a machine learning technique called probit (“probability unit”) classification. Probit classification is exactly like logistic regression classification except that where LR uses the logistic sigmoid function to compute output, probit uses the cumulative density function of the Gaussian (Normal) distribution. The CDF is the area under the curve of the Gaussian bell-shaped curve, from -infinity to some value z.

CumulativeDensityFunctionAndLogisticSigmoidGraph

So, to write probit classification code, I needed to implement a CDF method using C#. Unlike the logistic sigmoid function which is very simple — 1.0 / (1.0 + Math.Exp(-z)) — there is no simple way to compute CDF. The most common approach is to use one of many close approximations. I used a famous equation 7.1.26 from a famous reference the “Handbook of Mathematical Functions” (1965), by Abramowitz and Stegun. The book is often just called “A&S”.

Actually, equation 7.1.26 is something called the “error function” (often called “erf”) but the CDF is just a slight modification of erf. Anyway, here’s one way to code CDF using C#:

static double CumDensity(double z)
{
  double p = 0.3275911;
  double a1 = 0.254829592;
  double a2 = -0.284496736;
  double a3 = 1.421413741;
  double a4 = -1.453152027;
  double a5 = 1.061405429;

  int sign;
  if (z < 0.0)
    sign = -1;
  else
    sign = 1;

  double x = Math.Abs(z) / Math.Sqrt(2.0);
  double t = 1.0 / (1.0 + p * x);
  double erf = 1.0 - (((((a5 * t + a4) * t) + a3)
    * t + a2) * t + a1) * t * Math.Exp(-x * x);
  return 0.5 * (1.0 + sign * erf);
}

I wrote a little harness to spit out values of CDF for z values from -4.0 to +4.0 (every 0.25), and then copy-pasted the results into Excel and made a graph. I also put values for the logistic sigmoid on the graph to see how they compare.

As the graph shows, both functions range from 0.0 to 1.0. So, it isn’t a surprise that logistic regression classification and probit classification give pretty much the same results. As it turns out, logistic regression is easier to work with, in part because it is easily differentiable which makes model training using calculus based techniques possible. Different fields tend to use one or the other classification technique. For example, in economics, probit seems to be more common, but in other fields, logistic regression seems more common.

Posted in Machine Learning | Leave a comment

A Digi-Comp ii Emulator and Simulation using Python

Digi-Comp ii was an amazing mechanical toy computer made in the 1960s. It worked by rolling marbles down a plastic surface filled with toggle switches. A company called Evil Mad Scientist recently released a beautiful wooden re-creation (below).

DigiCompPhoto

I decided to write a computer program to emulate (or simulate depending on your point of view) the Digi-Comp ii. My Python programming was a bit rusty because I’ve mostly been using C#, so I thought I’d use Python just for fun.

DigiCompSimulation2

My demo program multiples 3 and 13. The calling code, with some minor edits to save space, is:

print "\nBegin Digi-Comp ii simulation \n"
dc = DigiComp()
dc.Initialize(35) # 35 balls in top reservoir
print "\nInitial state: "
dc.Display()

# set up 3 * 13 problem
dc.q1 = 1; dc.q2 = 1; dc.q3 = 0
dc.m1 = 1; dc.m2 = 0; dc.m3 = 1; dc.m4 = 1
dc.multiply = 1
print "\nSettings to compute 3 * 13:"
dc.Display()
# result in A should be 1110010

dc.DoStart()
print "\nResult state (answer in A register):"
dc.Display()

print "\nEnd Digi-Comp ii simulation \n"

The simulator is mostly a (not-shown) program-defined class named DigiComp. The program has about 250 lines of Python code. For example, here are two of the class methods:

def DoStart(self):
  if self.top == 0:
    print "No balls in top reservoir"
    self.Display()
  else:
    self.top = self.top - 1 # launch ball
    self.DoMultiply()

def DoMultiply(self):
  if self.multiply == 0: self.DoClear() # go right
  elif self.multiply == 1: self.DoD2() # go left

The calling code sets Q (called MQ in the Digi-Comp manual) to 110 = 3d. Then it sets M to 1011 = 13d. The DoStart method launches the first virtual ball. The result is in the A register and is 1110010 = 1 + 2 + 4 + 32 = 39d. All in all, it was a fun little exercise.

Posted in Machine Learning, Miscellaneous

My Top Ten Favorite Twilight Zone Episodes

I’m writing this blog post on the 4th of July. There’s a Twilight Zone marathon on TV (as usual) so I thought I’d list my favorite episodes so you can see if my list has any of your favorites.

1. It’s a Good Life – This is the episode where a young boy, Anthony, has the power to “send people to the cornfield”. He holds his neighbors as virtual hostages. Jack-in-the-boxes are creepy.

itsAGoodLife

2. Nightmare at 20,000 Feet – William Shatner is on an airplane during a storm and thinks he sees some sort of creature on the wing of the airplane. Is he hallucinating or not?

Nightmare

3. Living Doll – Telly Savalas as an evil stepfather does not like his daughter’s Talky Tina doll. The feeling is mutual. “My name is Talky Tina and I’m going to kill you.”

LivingDoll

4. Will the Real Martian Please Stand Up? – A group of bus passengers are stranded in a café during a snowstorm. One of them is a Martian scout in advance of an invasion. Why is the café owner not worried?

RealMartian

5. The Howling Man – An American traveler seeks shelter in a remote European castle during a storm. He discovers that Brother Jerome is holding a man captive in a dungeon. Jerome claims the prisoner is Satan. The prisoner claims Jerome is insane. One of them is right.

HowlingMan

6. Death Ship – Three astronauts land on an alien planet and find a crashed spaceship identical to theirs – complete with exact replicas of dead versions of themselves! Are they really dead? Is it alien mind control? An alternate dimension?

DeathShip

7. To Serve Man – A group of seven-foot tall aliens come to earth bearing gifts. Two skeptical code breakers try to discover if the aliens are benign or just hungry.

ToServeMan

8. The Jeopardy Room – A Soviet agent (played by Martin Landau) is attempting to defect to the West but is trapped inside a dingy hotel room by an assassin. The assassin tells him the room is rigged with a bomb. Is the trigger hooked to the door knob? The window latch?

JeopardyRoom

9. The Silence – At an old school men’s club, an aristocratic member is annoyed by a new-money member who is constantly talking. They bet $500,000 the talky guy can/cannot stay silent for exactly one year. A double-twist ending.

TheSilence2

10. Mirror Image – A woman waiting at a bus stop experiences weird events. She thinks she sees her exact twin. Is the twin real? Is the twin an evil replica from another dimension? Or is she just having a nervous breakdown?

MirrorImage

Posted in Top Ten

Distorting the MNIST Image Data Set

I wrote an article titled “Distorting the MNIST Image Data Set” in the July 2014 issue of MSDN Magazine. See http://msdn.microsoft.com/en-us/magazine/dn754573.aspx.

ImageDistortDemo

The NMIST (mixed National Institute and Standards and Technology) image data set is a collection of small (28 by 28 pixels) black-and-white images of handwritten 0-9 digits. There are 60,000 images in a training set and another 10,000 images in a test set.

The data sets were created so that image recognition researchers and practitioners can test new algorithms against the set, and compare their results with other results.

Even 60,000 training images aren’t very many, so a clever idea is to create additional training images from the original images, by slightly distorting each image. For example, the screenshot above shows the third training image, which happens to be a ’4′, and a distorted version.

Distorting images is surprisingly tricky. There are many possibilities, but the technique I demonstrate uses a Gaussian kernel, which is an interesting topic in its own right.

Posted in Machine Learning

Selected Software Development Conferences in 2014

There are a lot of conference for software developers. As I’m writing this blog post, year 2014 is about half-way over, so here are a few significant (meaning only that I know about them) software development conferences scheduled for the second half of 2014. If you know any I missed, tell me at jamesDmccaffrey -at- hotmail.com.


OSCON (Open Source Convention) – As the title suggests, this is for developers who use Python and other open source technologies. Attendance varies but is typically about 2,500. Run by O’Reilly Media who does the strange animal cover books. Recommended if you use open source software. Price: about $2500. (Yikes.)

http://www.oscon.com/oscon2014

OSCON2

July 20-24, 2014, Portland, Oregon


Visual Studio Live – This conference has been around for many years. Typically has 500 – 1500 attendees and a significant Microsoft presence. Run by 1105 Media who publishes Visual Studio Magazine. Highly recommended. Price: about $2100.

http://vslive.com

VSLive

August 18-22, 2014, Redmond, Washington
October 6-9, 2014, Washington, DC
November 17-21, 2014, Orlando, Florida


IT/DevConnections – Another long-running conference. Typically 500 – 2000 attendees and a moderate Microsoft presence. Run by Penton Media who publishes Windows IT Pro Magazine. Strongly recommended. Price: about $1900.

http://www.devconnections.com

DevConnections2

September 15-19, 2014, Las Vegas, Nevada


JavaOne Conference – Run by Oracle, a company I’m not very fond of, but this conference gets good reviews. Many thousands of attendees. Recommended, if you use Java, obviously. Price: about $2000.

https://www.oracle.com/javaone/index.html

JavaOne

September 28 – October 2, 2014, San Francisco, CA


Better Software Conference – Run by a company called Software Quality Engineering (SQE). Attendance typically about 500 – 1000 attendees. In my opinion, the speaker quality at SQE events has been very poor in recent years. Not recommended at this time. Price: about $1900.

http://bsceast.techwell.com/

BetterSoftware2

November 9-14, 2014, Orlando, Florida


DevIntersection Conference – A relatively new (2014 will be their third year) independent conference run by some of the former managers of the DevConnections Conference. I expect roughly 2500 attendees. Has a strong Microsoft presence. Very highly recommended. Price: about $1800 (Excellent value).

http://www.devintersection.com/

DevIntersection2

November 10-13, 2014, Las Vegas, Nevada


App Dev Trends Conference – A brand new conference. Typically, a first year conference draws about 300-600 people, but you can often get great pricing. Run by 1105 Media so the event will be professionally handled. I intend to go, so in other words, recommended. Price: about $1900.

http://appdevtrends.com

AppDevTrends2

December 8-11, 2014, Las Vegas, Nevada


Posted in Machine Learning, Miscellaneous, Software Test Automation

Equal Width Binning using C#

A common machine learning task is converting numeric data into categorical data. For example, suppose you have a set of employee annual incomes with values such as $48,000.00 and $39,500.00 and you want to transform them into values such as “low”, “medium”, and “high”. This is called discretization or, informally, binning the data. Binning is done for two main reasons. First, some ML algorithms expect categorical data. Second, binning tends to smooth out differences between data points which is sometimes, but not always, useful.

EqualWidthBinning

There are three main ways to bin data. The simplest is to create intervals, or “buckets”, of equal width. The second approach, equal frequency, is to create buckets so that each bucket has an (approximately) equal number of data values in it. The third approach is to create buckets using a clustering algorithm such as k-means so that data values are grouped by similarity to each other. Each of the three techniques has significant pros and cons so there is not one clear best way to bin data.

There are many, many ways to implement equal width binning. Here’s one example. Suppose the data to bin is below and you want 4 bins:

2.0, 3.0, 4.0, 5.0, 6.0, 8.0, 9.0, 10.0, 12.0, 14.0

I’ve ordered the data, but equal width binning doesn’t require it. First, the min and max values are found:

min = 2.0
max = 14.0

Next the width of each bucket is computed:

width = (max - min) / number-buckets
      = (14.0 - 2.0) / 4
      = 3.0

Next, the intervals are constructed:

[2.0, 5.0)  [5.0, 8.0)  [8.0, 11.0)  [11.0, 14.0)  -> data
  [0]  [1]    [2]  [3]    [4]   [5]     [6]   [7]  -> index
    {0}         {1}         {2}           {3}      -> bin

Next, the two end intervals are modified to catch outliers:

[-inf, 5.0)  [5.0, 8.0)  [8.0, 11.0)  [11.0, +inf)
  [0]  [1]    [2]  [3]    [4]   [5]     [6]   [7]
    {0}         {1}         {2}           {3}

Now you can bin. Suppose you have some value x = 9.5. It belongs in bin {2} so it maps to categorical value “2″.

In the approach above, most interval values are stored twice. This isn’t necessary but gives better clarity at the expense of a little bit of extra storage.

In C# code, without error-checking:

static double[] MakeIntervals(double[] data, int numBins)
{
  double max = data[0]; // find min & max
  double min = data[0];
  for (int i = 0; i < data.Length; ++i)
  {
    if (data[i] < min) min = data[i];
    if (data[i] > max) max = data[i];
  }
  double width = (max - min) / numBins; // compute width

  double[] intervals = new double[numBins * 2]; // intervals
  intervals[0] = min;
  intervals[1] = min + width;
  for (int i = 2; i < intervals.Length - 1; i += 2)
  {
    intervals[i] = intervals[i - 1];
    intervals[i + 1] = intervals[i] + width;
  }
  intervals[0] = double.MinValue; // outliers
  intervals[intervals.Length - 1] = double.MaxValue;

  return intervals;
}

And the partner method:

static int Bin(double x, double[] intervals)
{
  for (int i = 0; i < intervals.Length - 1; i += 2)
  {
    if (x >= intervals[i] && x < intervals[i + 1])
      return i / 2;
  }
  return -1; // error
}

Calling the methods could resemble:

double[] data = new double[] { 2.0, 3.0, . . , 14.0 };
int numBins = 4;
double[] intervals = MakeIntervals(data, numBins);

double x = 9.5;
int bin = Bin(x, intervals);
Console.WriteLine("\nBin of " + x + " is " + bin);

Instead of the static method approach, you could use an OOP approach, putting MakeIntervals code into a constructor and Bin code into a member method. Calling would look like:

double[] data = new double[] { 2.0, 3.0, . . , 14.0 };
int numBins = 4;
Binner b = new Binner(data, numBins);

double x = 9.5;
int bucket = b.Bin(x);
Console.WriteLine("\nBin of " + x + " is " + bucket);

Binning code isn’t too difficult but as much as any code I deal with, has an absolutely huge number of options for implementation.

Posted in Machine Learning