I Gave a Talk About Neural Network Error and Accuracy

For most of the software developers I know, there’s tremendous interest in machine learning and neural networks. So I’ve been giving a series of one-hour talks aimed at developers and designed to get them up to speed quickly.

A few days ago I talked about error and accuracy. My first two talks explained how the neural network input-output process works, and then how the back-propagation training algorithm works. In both of those talks, I assumed the underlying Error function that compares computed output values, such as (0.20, 0.70, 0.10), with correct target values, such as (0, 1, 0), is the simple squared error function. For the data just mentioned, squared error would be (0.20 – 0)^2 + (0.70 – 1)^2 + (0.10 – 0)^2 = 0.04 + 0.09 + 0.01 = 0.14.

The next logical step in a developer’s understanding of neural networks is understanding cross entropy error. Cross entropy error isn’t as obvious as squared error. For the data above, cross entropy error is -1 * [log(0.20)*0 + log(0.70)*1 + log(0.10)*0] = 0.3567.

As it turns out, using cross entropy error rather than squared error, is often (but not always) better because you get more accurate predictions. The reasons why this is so are rather subtle.

Briefly, when using squared error, to update weights during training, there is a term (1 – output)*(output). Because “output” is a probability between 0 and 1, the term is always between 0 and 0.25 (when output = 0.5). For example, if output = 0.60 then the term is 0.60 * 0.40 = 0.24. The small number makes training a bit slower. But if you use cross entropy error, the term is not there in the update so training is faster. Sort of. I’ve skipped over a ton of important details.

Now accuracy is just the percentage correct predictions. Ultimately, prediction accuracy is the metric you’re interested in, but during neural network training, accuracy is too crude.

The moral of the story is that anyone can become an expert on neural networks. But there are a lot of details that need to be learned one at a time. I estimate that in 16 one-hour talks, a developer can become an expert.

This entry was posted in Machine Learning. Bookmark the permalink.

6 Responses to I Gave a Talk About Neural Network Error and Accuracy

  1. BoilingCoder says:

    Interesting observation you made, how about this solution :
    Double learnRate = 0.01 ; // so after 100 epoch
    Double speedLearnFactor = Math.Max(1,2 – epoch * speedLearnRate); //linear decay 2 to 1
    SquaredError = SquaredError * speedLearnFactor ;

    well i’m going to test it on the irish flower šŸ˜‰

  2. BoilingCoder says:

    I set speedLearnRate to 0.02 when used with max epochs 200, during a Sweep training.
    Then I got the same score on irish flowers, but when using 0.01, or 0.006
    I got more networks who where less accurate (98.75% vs 99.16%)

  3. Is it possible to see these one hour talks anywhere?

    • Right now the videos are only viewable on the internal Microsoft corporate network. I’ve been unable to get the powers-that-be to allow the videos to be viewed outside of Microsoft, but I’m working on a way to record them again, this time for viewing on YouTube or Channel-9. JM

      • That is a shame… I find your blog posts extremely informative and helpful. I have read all your posts on NN and GA since 2013, but only recently started writing my own NN library. I knew from the beginning that I would need a serious C# “configurable” library in order to do some of the things i wanted to do. Hence, my current library allows me to dynamically configure layer counts, neuron counts, activation functions, and output error functions. It has been extremely helpful in my pursuit of customizable functionality.

        My current library (written in C#) took all your NN techniques and put them all into one library. I focus on using ‘Lists” instead of hard coded arrays, and I use Linq to handle updates and queries… Additionally, my back propagation uses “reflection” to point to specific functions. I.E. when I use Sigmoid as the activation function, Back-Prop will use SigmoidDerivative, and SigmoidGradient respectively. This is also done for HyperTangent, ArcTangent, and SoftMax, with Cross Entropy handled as well.

        I’m really liking what I wrote so far, but now I want to implement dynamic optimization. That said, I need a better understanding of EO and Particle Swarm. I have the GA written per one of your blogs, and now I would like to implement this into the NN library as an option. I would like an opportunity to connect with you for some basic “academic” type questions on EO and Particle Swarm. This is why I was looking for your videos…

        I would like an opportunity to collaborate on this if possible, and as such, I would be willing to share my current library with you, for your evaluation, and use if you so choose. The library works great with all of your examples, are very efficient, and can be dynamically changed per your needs… Deep NN is simply a matter of changing a configuration element.

        Thank you for your blog posts and information,

        Stew Basterash.

Comments are closed.