## Particle Swarm Optimization using C#

I wrote an article titled “Particle Swarm Optimization using C#” which appears in the November 2013 issue of Visual Studio Magazine. See http://visualstudiomagazine.com/articles/2013/11/01/particle-swarm-optimization.aspx.

Particle Swarm Optimization (PSO) is a technique based on group behavior such as bird flocking. PSO can be used to find an approximate solution to a numerical optimization problem in situations where classical techniques like those based on Calculus derivatives don’t work or aren’t feasible. Training a neural network is an example of such an optimization problem; the goal is to find the set of values for a neural network’s weights and biases so that the error between computed outputs and known outputs on a collection of training data is minimized.

In the machine learning community, by far the most common technique used to train a neural network is called back-propagation. However, I generally prefer to use PSO.

Because PSO is conceptually quite a bit different from most traditional algorithms, in the VSM article, instead of demonstrating how PSO can be used to train a neural network, the article shows how to use PSO to solve a dummy benchmark problem of finding the values for x0 and x1 that minimize the function:

```z = x0 * exp( -(x0^2 + x1^2) )
```

The graph of this function, and a screenshot of a demo program solving the minimization problem are shown in the images below. I intend to follow up the November article with an article that shows exactly how to use PSO to train a neural network.

## Reading the MNIST Data Set with C#

The MNIST data set is a well-known collection of image data of handwritten digits (0-9) that is used to benchmark machine learning pattern recognition algorithms. The MNIST data is stored in 4 binary files, which can be awkward to deal with directly so I decided to write a C# program to access the data.

Each digit image is a 28 x 28 pixel set where each pixel value is between 0 (white) and 255 (completely black). Values between 0 and 255 are shades of gray.

There are 60,000 training digits and 10,000 test digits. Each of the two sets is stored in two binary files, one containing the pixel data and the other containing the corresponding label (0-9). The data files are available at http://yann.lecun.com/exdb/mnist/ in gzip form. I installed the free 7-Zip utility to unzip the files (I find WinZip increasingly annoying with their advertising).

The screenshot below shows a snapshot of the program reading the 10,000 test images. The program simulates the image by using a blank space for white, a dot/period for gray, and a ‘O’ for black. The associated label is displayed below the image representation.

The program code is below. The main idea is to define a DigitImage class that has the pixels and the label of one digit. I open both files and read 28 x 28 bytes from the image file and one byte from the label file, and then combine them. Each file has some header ints (4 for the image data and 2 for the label data) that are read and discarded.

```
using System;
using System.IO;

{
class Program
{
static void Main(string[] args)
{
try
{
Console.WriteLine("\nBegin\n");
FileStream ifsLabels =
new FileStream(@"C:\t10k-labels.idx1-ubyte",
FileMode.Open); // test labels
FileStream ifsImages =
new FileStream(@"C:\t10k-images.idx3-ubyte",
FileMode.Open); // test images

byte[][] pixels = new byte[28][];
for (int i = 0; i < pixels.Length; ++i)
pixels[i] = new byte[28];

// each test image
for (int di = 0; di < 10000; ++di)
{
for (int i = 0; i < 28; ++i)
{
for (int j = 0; j < 28; ++j)
{
pixels[i][j] = b;
}
}

DigitImage dImage =
new DigitImage(pixels, lbl);
Console.WriteLine(dImage.ToString());
} // each image

ifsImages.Close();
brImages.Close();
ifsLabels.Close();
brLabels.Close();

Console.WriteLine("\nEnd\n");
}
catch (Exception ex)
{
Console.WriteLine(ex.Message);
}
} // Main
} // Program

public class DigitImage
{
public byte[][] pixels;
public byte label;

public DigitImage(byte[][] pixels,
byte label)
{
this.pixels = new byte[28][];
for (int i = 0; i < this.pixels.Length; ++i)
this.pixels[i] = new byte[28];

for (int i = 0; i < 28; ++i)
for (int j = 0; j < 28; ++j)
this.pixels[i][j] = pixels[i][j];

this.label = label;
}

public override string ToString()
{
string s = "";
for (int i = 0; i < 28; ++i)
{
for (int j = 0; j < 28; ++j)
{
if (this.pixels[i][j] == 0)
s += " "; // white
else if (this.pixels[i][j] == 255)
s += "O"; // black
else
s += "."; // gray
}
s += "\n";
}
s += this.label.ToString();
return s;
} // ToString

}
} // ns

```

## Five Microsoft Technology Software Developer Conferences in 2014

It’s surprisingly difficult to find a list of software developer conferences. In my case, I am most interested in conferences that target Microsoft technologies and are in the United States, especially the West Coast.

Currently, the five conferences that are most relevant to me are two conferences put on by Microsoft (Build and TechEd), and three conferences put on by non-Microsoft companies with Microsoft sponsorship (DevConnections, DevIntersection, Visual Studio Live). I’ve attended and spoken at all five of these events in the past, and in general, can recommend all of them.

As far as I can tell, here is what 2014 is shaping up like. Often, a particular conference is offered twice per year, once on the West Coast and once on the East Coast. Because I live on the West Coast, here are the potential events for me:

1. Visual Studio Live, March 10-14, Las Vegas
2. Microsoft TechEd, May 12-15, Houston
3. Microsoft Build, unknown date, unknown city
4. DevConnections, September 15-19, Las Vegas
5. DevIntersection, November 9-12(?), Las Vegas

Visual Studio Live – This conference has been around for many years. VS Live tends to have a broad range of topics and targets a wide range of skill sets, but mostly intermediate-level developers I’d say. In 2014, VS Live will be in Las Vegas in March, Chicago in May, Redmond in August, and Orlando in November. Speakers come from both Microsoft and other companies. Visual Studio Live is put on by 1105 Media which does a lot of other kinds of conferences, and also publishes Visual Studio Magazine. Highly recommended for intermediate-level developers but beginners and advanced developers can find some interesting talks too.

Microsoft TechEd – In 2014 TechEd swallows the Microsoft Management Summit. In the past TechEd emphasized training for developers and IT people, and MMS emphasized training and products for IT people. Each year there was more overlap so combining the events makes sense. In 2014 TechEd will be in Houston – an unusual choice of venue. Highly recommended for IT pros and enterprise developers.

Microsoft Build – Build targets Web, system, desktop, mobile, and embedded software developers. Build is a combination of the old PDC (Professional Developer Conference) for traditional software developers and MIX (originally stood for “Meet, Interact, eXplore”) for Web developers. The dates and location of Build have not been announced but my wild guess is that Build will be in October in Las Vegas. Highly recommended for developers of all skill levels.

DevConnections – Another long-running conference that’s been around for at least 10 years. DevConnections has gone through some management changes recently, and 2013 was the first event put on by the new team. I wasn’t there but some of my friends say the event was similar to previous years, focusing on intermediate-level developers. DevConnections is a bit broader in scope than Visual Studio Live and targets developers and SQL people and IT people. The conference is run by Penton Media, a big company that does many events and publishes magazines including Windows IT Pro and SQL Server Pro. The 2014 event is scheduled for September 15-19 in Las Vegas. Recommended for developers with intermediate and beginning level skills.

DevIntersection – DevIntersection held its first event in 2012. DevIntersection is a spin-off from DevConnections; in spite of the similar names, the two events are not related, however the conferences are similar in the sense that they target a broad audience. The people who now run DevIntersection used to run DevConnections, and I always thought those events were very nice. The Spring 2014 event will be from April 13-15 in Orlando Florida (too far away for me). The Fall 2014 event dates and location have not been announced but my guess is early November in Las Vegas. Highly recommended for developers with intermediate and beginning level skills.

There are many other conferences for software developers who use the MS technology stack, but I can recommend the five here on the basis of personal experience. All these conferences are a bit pricey (well to me anyway). For example, the TechEd Conference, not including hotel and travel, is about \$2000. The Visual Studio Live conference is about \$1600. It’s a tough sell to get your company to foot the bill for one of these conferences but maybe you can convince your management that the knowledge you’ll gain, your improved morale, and increased energy and productivity you’ll have after returning, are worth the price of one of these conferences.

Posted in Machine Learning

## Getting Data into Memory with Excel Add-In Interop

To extend the functionality of Excel (for example, adding a machine learning operation such as data clustering), you can write an Excel add-in. The basic add-in typically does the UI but to do anything meaningful you usually need to use Excel Interop to read worksheet contents into memory, and then after doing some processing, write values in memory to the worksheet. I described the add-in creation process in a previous blog post at http://jamesmccaffrey.wordpress.com/2013/07/08/analyzing-an-excel-2013-spreadsheet-programmatically-using-an-add-in/.

When I need to read the contents of a worksheet into memory, I typically use one of three approaches. One way is like so:

```using Excel = Microsoft.Office.Interop.Excel;
using Tools = Microsoft.Office.Tools.Excel;
using Office = Microsoft.Office.Core;
. . .
Excel.Worksheet worksheet =
Excel.Range usedRange = worksheet.UsedRange;
object[,] allData = usedRange.Value2;
// trim empty rows or columns
```

I use the UsedRange to get a reference to all cells that have values, and then the Value2 property to store those values into a two-dimensional matrix of type object. Unfortunately, the UsedRange property returns all cells that either currently have contents or had contents at some time in the past, so you might get many empty rows or columns. This requires some post-processing of the object[,] matrix in memory.

A second approach can be used when the add-in has some UI that allows users to specify the range of data. For example:

```string upperLeftCell = // get from UI
string lowerRightCell = // get from UI
// or get one string in "A1:B2" form
Excel.Range specifiedRange =
worksheet.get_Range(upperLeftCell, lowerRightCell);
object[,] someData = specifiedRange.Value2;
```

A third approach gets a user-selected (via the mouse) range of data. For example:

```Excel.Range selectedRange =
object[,] selectedData = selectedRange.Value2;
listBox1.Items.Add("First cell of user-selected data is " +
selectedData[1, 1]);
```

In all cases, annoyingly, the object[,] matrix is 1-based, rather than 0-based. This can really cause bugs. So much so, that in many cases I transfer the contents to a normal 0-based matrix.

```Excel.Range usedRange = worksheet.UsedRange;

Excel.Range upperLeftRange = usedRange.get_Resize(1, 1);
```

Working with Excel add-ins is interestijng but not always intuitive.

Posted in Machine Learning, Software Test Automation

## Why You Should Use Cross-Entropy Error Instead Of Classification Error Or Mean Squared Error For Neural Network Classifier Training

When using a neural network to perform classification and prediction, it is usually better to use cross-entropy error than classification error, and somewhat better to use cross-entropy error than mean squared error to evaluate the quality of the neural network. Let me explain. The basic idea is simple but there are a lot of related issues that greatly confuse the main idea. First, let me make it clear that we are dealing only with a neural network that is used to classify data, such as predicting a person’s political party affiliation (democrat, republican, other) from independent data such as age, sex, annual income, and so on. We are not dealing with a neural network that does regression, where the value to be predicted is numeric, or a time series neural network, or any other kind of neural network.

Now suppose you have just three training data items. Your neural network uses softmax activation for the output neurons so that there are three output values that can be interpreted as probabilities. For example suppose the neural network’s computed outputs, and the target (aka desired) values are as follows:

```computed       | targets              | correct?
-----------------------------------------------
0.3  0.3  0.4  | 0  0  1 (democrat)   | yes
0.3  0.4  0.3  | 0  1  0 (republican) | yes
0.1  0.2  0.7  | 1  0  0 (other)      | no
```

This neural network has classification error of 1/3 = 0.33, or equivalently a classification accuracy of 2/3 = 0.67. Notice that the NN just barely gets the first two training items correct and is way off on the third training item. But now consider the following neural network:

```computed       | targets              | correct?
-----------------------------------------------
0.1  0.2  0.7  | 0  0  1 (democrat)   | yes
0.1  0.7  0.2  | 0  1  0 (republican) | yes
0.3  0.4  0.3  | 1  0  0 (other)      | no
```

This NN also has a classification error of 1/3 = 0.33. But this second NN is better than the first because it nails the first two training items and just barely misses the third training item. To summarize, classification error is a very crude measure of error.

Now consider cross-entropy error. The cross-entropy error for the first training item in the first neural network above is:

```-( (ln(0.3)*0) + (ln(0.3)*0) + (ln(0.4)*1) ) = -ln(0.4)
```

Notice that in the case of neural network classification, the computation is a bit odd because all terms but one will go away. (There are several good explanations of how to compute cross-entropy on the Internet.) So, the average cross-entropy error (ACE) for the first neural network is computed as:

```-(ln(0.4) + ln(0.4) + ln(0.1)) / 3 = 1.38
```

The average cross-entropy error for the second neural network is:

```-(ln(0.7) + ln(0.7) + ln(0.3)) / 3 = 0.64
```

Notice that the average cross-entropy error for the second, superior neural network is smaller than the ACE error for the first neural network. The ln() function in cross-entropy takes into account the closeness of a prediction and is a more granular way to compute error.

By the way, you can also measure neural network quality by using mean squared error but this has problems too. The squared error term for the first item in the first neural network would be:

```(0.3 - 0)^2 + (0.3 - 0)^2 + (0.4 - 1)^2 = 0.09 + 0.09 + 0.36 = 0.54
```

And so the mean squared error for the first neural network is:

```(0.54 + 0.54 + 1.34) / 3 = 0.81
```

The mean squared error for the second, better, neural network is:

```(0.14 + 0.14 + 0.74) / 3 = 0.34
```

MSE isn’t a hideously bad approach but if you think about how MSE is computed you’ll see that, compared to ACE, MSE gives too much emphasis to the incorrect outputs. It might also be possible to compute a modified MSE that uses only the values associated with the 1s in the target, but I have never seen that approach used or discussed.

So, I think this example explains why using cross-entropy error is clearly preferable to using classification error. Somewhat unfortunately there are some additional issues here. The discussion above refers to computing error during the training process. After training, to get an estimate of the effectiveness of the neural network, classification error is usually preferable to MSE or ACE. The idea is that classification error is ultimately what you’re interested in.

Suppose you are using back-propagation for training. The back-propagation algorithm computes gradient values which are derived from some implicit measure of error. Typically the implicit error is mean squared error, which gives a particular gradient equation that involves the calculus derivative of the softmax output activation function. But you can use implicit cross-entropy error instead of implicit mean squared error. This approach changes the back-propagation equation for the gradients. I have never seen research which directly addresses the question of whether to use cross-entropy error for both the implicit training measure of error and also neural network quality evaluation, or to use cross-entropy just for quality evaluation. Such research may (and fact, probably) exists, but I’ve been unable to track any papers down.

To summarize, for a neural network classifier, during training you can use mean squared error or average cross-entropy error, and average cross-entropy error is considered slightly better. If you are using back-propagation, the choice of MSE or ACE affects the computation of the gradient. After training, to estimate the effectiveness of the neural network it’s better to use classification error.

Posted in Machine Learning

## My Top 10 Favorite New Wave Songs of the 1980s

Most of my blog posts are purely technical but sometimes, just for fun, I’ll do a top-10 list. Like this.

The 1980s had some really interesting music. I distinctly remember the very first times I heard songs by Adam Ant, Flock of Seagulls, and other new wave bands. It’s kind of difficult to describe exactly what a new wave song is so I won’t try. Here are my top 10 favorite new wave songs. I don’t mean these are the best songs, I mean that if I was going on a road trip and could only take ten new wave songs, these would be the ones.

1. “I Ran”, A Flock of Seagulls (1982). This song is almost a cliche for 1980s new wave excess and bad hair, but it’s still an excellent, catchy song. Great use of synthesizer. http://www.youtube.com/watch?v=BJ7NVjZ-Eyg

2. “Do You Wanna Hold Me”, Bow Wow Wow (1983). One of my top 10 songs of all time. Combination of great guitar work and incredible vocals by Annabella Lwin. A.L. was really beautiful and I think she was only 16 or 17 when she sang this song. http://www.youtube.com/watch?v=l7BwRL2yhGQ

3. “Public Image”, Public Image Ltd (1979). This is really a post-punk, pre-new-wave song but this is my list so I can do what I want. Really a simple song but John Lydon (aka Johnny Rotten when he was in the Sex Pistols) has a really interesting, distinctive voice. http://www.youtube.com/watch?v=ylOCIP54PIQ

4. “Invisible Sun”, The Police (1981). The Police are probably on most people’s top-10 songs of the 80s list, but usually not for this song. It’s always been my favorite from this band. http://www.youtube.com/watch?v=NIylUcGDi-Y

5. “The Cutter”, Echo and the Bunnymen” (1983). This is a very complex song musically that has held up well over time. http://www.youtube.com/watch?v=VM6j14DDtGI

6. “Whisper to a Scream (Birds Fly)”, Icicle Works (1983). Not really sure why I like this song so much but it’d be on my road-trip list anytime. Nice balance of vocals, drums, guitar and synthesizer. http://www.youtube.com/watch?v=NVQCpI4GbKQ

7. “Major Tom”, Peter Schilling (1983). I ‘m almost embarrassed to put this song on my list because it’s kind of silly but I like it. On some other day I’d substitute one of the honorable mention songs listed below. Totally weird lyrics/subject (even if you know the reference to Bowie’s “Space Oddity”) but a very catchy melody. http://www.youtube.com/watch?v=N1Hs2AQwDgA

8. “Roam”, The B-52s (1989). Kate Pierson and Cindy Wilson have two of the best voices of the 80s and together they’re fantastic. I pick “Roam” over some other B-52s great songs, “Rock Lobster” in particular. http://www.youtube.com/watch?v=IWEfmCvu8R8

9. “Church of the Poison Mind”, Culture Club (1983). I’m not really a big Culture Club fan but I really think this song complements the others on my list nicely. Boy George was too weird and creepy but he could sure sing. http://www.youtube.com/watch?v=HVzAH0FtNwg

10. “The Promise”, When in Rome (1988). I normally don’t like relatively slow songs but the addictive melody of this song puts it on my top-10 list. http://www.youtube.com/watch?v=5HI_xFQWiYU

Notes:

“Take on Me” (A-Ha) is a good song that I remember for the very clever video. There are a zillion songs that remind me of 1980s movies like “Sixteen Candles” and “Breakfast Club”: “Don’t You Forget About Me” (Simple Minds), “True” (Spandau Ballet), and on and on, but they don’t make it on my road-trip list. On some days I’d put “Lies” or “Hold Me Now” by the Thompson Twins on my list. I’m a bit surprised that no Duran Duran songs quite made my list. I like “We Got the Beat” (The Go-Go’s) but it just missed the cut. “Dancing with Myself” (Billy Idol) is almost a good song but too primitive musically. Two of my favorite slow songs of the 1980s are “Wishful Thinking” (China Crisis) http://www.youtube.com/watch?v=oj20LKdg8-8 and “Dance Away” (Roxy Music) http://www.youtube.com/watch?v=7lLcZPhTvFE but I don’t like slow for road trips. “One Way or Another” (Blondie) is good but not top-10 for me.

Posted in Uncategorized

## K-Fold Cross-Validation for Neural Networks

I wrote an article “Understanding and Using K-Fold Cross-Validation for Neural Networks” that appears in the October 2013 issue of Visual Studio Magazine. See http://visualstudiomagazine.com/articles/2013/10/01/understanding-and-using-kfold.aspx. Exactly what k-fold cross-validation is, and why it is used, are somewhat difficult to explain clearly. Let me try. The main technical challenge when working with a neural network is training the network, which means finding values for the network’s many weights and biases so that for a given set of input values, the network’s computed output values closely match known outputs of a set of training data.

Because neural networks are universal function approximators, given enough time, it is always possible (in theory) to find a set of weights and biases so that computed outputs exactly match training data outputs. But if you use those weights and bias values on new, previously unseen data, your neural network will predict very poorly. This is called over-fitting. (I breezed through this but over-fitting is a very deep concept).

OK, so the problem is over-fitting. There are many ways to deal with over-fitting. K-fold cross validation is one. The idea is to break the training data into k subsets, where k is usually 10. Then you run your training algorithm (the three most common approaches are back-propagation, particle swarm optimization, and genetic algorithm optimization) 10 times. On the first training run you use the 9/10 of the training data to train, and then compute the network’s accuracy using the 1/10 of the remaining data. This process is repeated, so that each 1/10 subset is used exactly once as the validation set. When finished you take the average of the 10 accuracies and use it as the overall estimate of the accuracy of the network. In short, k-fold cross-validation gives you an estimate of a neural network’s accuracy whehn the network was constructed using particular values for number of hidden nodes and trainingb parameters.

How does this help? Well, if you do k-fold cross-validation repeatedly, and during the training phase use different values for the training technique’s parameters (different techniques have different parameters – back-prop needs learning rate and momentum, particle swarm needs inertia, cognitive and social weights, and so on) and also try different numbers of hidden nodes, you can find the best values for number of hidden nodes and training parameters. Then with these in hand you can finally train your network using all your data, with the best umber of hidden nodes and training parameters.

In pseudo-code:

```loop "many" times
pick a number of hidden nodes
pick training parameters (learning rate, etc.)

// k-fold
divide train data into 10 parts
for i = 1 to 10
train network using 9 parts
compute accuracy using 1 part
end for
compute average accuracy of the 10 runs

if avg accuracy best found so far
save number hidden nodes used
save training parameters used
save best average accuracy value
end if
end loop

train network using all data
(using best number hidden nodes,
and best training parameters)
estimated accuracy is best accuracy found above
```

Anyway, k-fold cross-validation was difficult for me to grasp because there are so many inter-related issues, but after I thought the process over enough times, it finally made sense.

Posted in Machine Learning