Two Ways to Deal with the Derivative of the ReLU Function

I gave a talk about the back-propagation algorithm recently. Back-propagation is used to train a neural network. Consider a math equation like y = 5X1 + 7X2 so the equation has two inputs, X1 and X2, and two constants, 5 and 7, that determine the output. If you think of a NN as a very complex math equation, the weights of the NN are the constants. Training a NN is the process of using data with known correct input and output values, to find the values of the weights. And back-prop is the most common algorithm used for training.

A NN uses one or more internal activation functions. One common activation function is the logistic sigmoid, logsig(x) = 1.0 / (1.0 + e^-x). Back-propagation requires the Calculus derivative of the activation function. If y = logsig(x), then the Calculus derivative is y’ = e^-x / (1.0 + e^-x)^2 and by a very cool, non-obvious algebra coincidence y’ = y * (1 – y).

But for deep neural networks, a common activation function is ReLU(x) = max(0, x). If you graph y = ReLU(x) you can see that the function is mostly differentiable. If x is greater than 0 the derivative is 1 and if x is less than zero the derivative is 0. But when x = 0, the derivative does not exist.

There are two ways to deal with this. First, you can just arbitrarily assign a value for the derivative of y = ReLU(x) when x = 0. Common arbitrary values are 0, 0.5, and 1. Easy!

A second alternative is, instead of using the actual y = ReLU(x) function, use an approximation to ReLU which is differentiable for all values of x. One such approximation is called softplus which is defined y = ln(1.0 + e^x) which has derivative of y’ = 1.0 / (1.0 + e^-x) which is, remarkably, the logistic sigmoid function. Neat!

When I implement a deep NN from scratch, I usually use the arbitrary-value-when-x-equals-zero approach. I have never seen any research that looks at which of the two ways to deal with y = ReLU(x) being non-differentiable at 0, is better.

Posted in Machine Learning | Leave a comment

I Give a Talk “Introduction to Deep Neural Networks”

I gave a talk titled “Introduction to Deep Neural Networks” recently. The goal was to give the audience of engineers the information they needed to understand what types of problems can be solved using a DNN, and what tools and libraries they can use to implement a DNN.

The term deep neural network has multiple meanings. DNN can refer to a specific type of neural network that is the same as a simple NN except the DNN has two or more hidden layers. DNN can also refer to one of many, more exotic forms of neural networks that have multiple hidden layers.

I described ordinary DNNs for classification and numeric prediction, convolutional NNs for image recognition, simple recurrent NNs (mostly of historical interest), and long short-term memory (LSTM) networks for natural language processing. I also speculated a bit about generative adversarial networks, and quantum computing.

In terms of tools and libraries, I explained how there ae many alternatives for non-deep NNs, but for DNNs, the only Microsoft approach I was aware of was the CNTK library (well other than coding from scratch which is very difficult).

Moral of the story: Maybe eight years ago, knowledge of simple neural networks wasn’t needed in many developer situations. Now that knowledge is almost essential. And knowledge of DNNS is quickly becoming a critically important skill for many developer scenarios.

Posted in Conferences, Machine Learning | 2 Comments

Neural Network Glorot Initialization

You’d think that initializing the weights and biases in a neural network wouldn’t be very difficult or interesting. No so.

The simplest way to initialize weights and biases is to set them to small (perhaps -0.01 to +0.01) uniform random values. And this works well for NNs with a single hidden layer. But a simple approach doesn’t always work well with deep NNs, especially those that use ReLU (rectified linear unit) activation.

One common initialization scheme for deep NNs is called Glorot (also known as Xavier) Initialization. The idea is to initialize each weight with a small Gaussian value with mean = 0.0 and variance based on the fan-in and fan-out of the weight.

For example, each weight that connects an input node to a hidden node has fan-in of the number of input nodes and fan-out of the number of hidden nodes. In pseudo-code the initialization is:

for-each input-hidden weight
  variance = 2.0 / (fan-in +fan-out)
  stddev = sqrt(variance)
  weight = gaussian(mean=0.0, stddev)

Instead of using variance = 2.0 / (fan-in + fan-out) with a Gaussian distribution, you can also use a Uniform distribution between [-sqrt(6) / sqrt(fan-in + fan-out), sqrt(6) / sqrt(fan-in + fan-out)]. Therefore, the term “Glorot Initialization” is ambiguous because it can refer to two somewhat different algorithms.

If you want to read the original research paper, do a Web search for “Understanding the Difficulty of Training Deep Feedforward Neural Networks”.

Posted in Machine Learning | Leave a comment

Neural Network Back-Propagation using Python

I wrote an article titled “Neural Network Back-Propagation using Python” in the June 2017 issue of Visual Studio Magazine. See

I strongly believe that when working with machine learning, even if you’re using a tool such as Weka or a library such as TensorFlow, it’s important to understand what is going on behind the scenes. And for me, the best way to understand a ML topic is by coding the topic from scratch.

Additionally, coding a ML system from scratch gives you complete control over the system, and allows you to customize the code and to experiment. Neural network back-propagation is an example of a topic that requires code for complete understanding (for me anyway).

I coded a demo program in Python plus the NumPy numeric add-on package. Why? Because Python plus NumPy has become the de facto standard API interface for leading deep learning libraries, notably Google TensorFlow and Microsoft CNTK. So a side-benefit of my article demo code is that you gain useful Python skills.

On the one hand, the ideas behind neural network back-propagation are not overwhelmingly difficult (even though they’re by no means easy). However, when you code back-propagation, a ton of important details are revealed.

The back-prop weight update code depends on the underlying error function assumption. If you assume mean squared error, then there are several equivalent forms. One is “squared computed output minus target” and another is “squared target minus computed output”. Both forms give the same error value, but lead to different update code.

Suppose, for a given training data item, the target vector is (0, 1, 0) and the computed outputs are (0.20, 0.70, 0.10). Using “squared target minus output” the error is (0 – 0.20)^2 + (1 – 0.70)^2 + (0 – 0.10)^2 = 0.04 + 0.09 + 0.01 = 0.14. Using the “squared output minus target” the error is (0.20 – 0)^2 + (0.70 – 1)^2 + (0.1 – 0)^2 = 0.14 again.

But the back-prop update code depends on the Calculus derivative of the error function. Here the target is a constant but the computed output is variable. The net result is that one form of error leads you to add a weight delta, and the other form of error leads you to subtract a weight delta.

The moral is that to completely understand neural network back-propagation, it’s a good idea to look at an actual code implementation.

Posted in Machine Learning | 1 Comment and Microsoft Talk Sports Technology ( is a Seattle-based Web site and is quite well-known in the Pacific Northwest. The site posts all kinds of interesting, tech-related stories. I had read a few sports-tech related articles written by a reporter, Taylor Soper. So, last week Tuesday, I cold-called Taylor and asked him if he’d be willing to come to Microsoft Research in Redmond and give a talk about what he has seen recently.

Taylor returned my call a few minutes later, and about five minutes after that, agreed to speak on Thursday. I really like dealing with people who make quick decisions, as opposed to people who’ll plan to have a meeting to plan another meeting to create a plan on what to do. (If you think I’m exaggerating, try working for a huge company sometime).

Anyway, Taylor came out to Microsoft on Thursday morning. My work buddy Bryan and I acted as hosts and introduced Taylor. His presentation was titled “The State of Sports Tech 2017” and it was excellent. To be honest, I always get a bit nervous when sponsoring a speaker I haven’t heard talk before — it’s very difficult to deliver a talk that’s both informative and interesting, but Taylor did both (to my relief).

I mention this because some of the absolute worst talks I’ve heard have been given by researchers to a non-research audience. One of the keys to a good talk is understanding exactly who your audience is. I know of one research department in a very large tech company, where the researchers were essentially black-balled from being selected to speak at their own company’s events with a non-research audience because previous talks by the researchers had been so incredibly bad (meaning the talks would have been perfect for fellow researchers, but were terrible for non-researchers).

Moving on. Taylor talked about all kinds of interesting things he’s seen lately. I was particularly interested in some of the work that’s going on related to what Taylor called “the fan experience”. Imagine watching a basketball game from an NBA basketball player’s point of view. Imagine watching an NFL football game or a soccer game where all kinds of statistical information about a player is displayed in real-time, next to the player.

Interestingly, I felt that some of the potential technologies might have a negative impact. For example, it’s technically possible to replace a baseball home plate umpire with a computer system that calls balls and strikes. Ugh! I’d hate that. I like the human-ness of sports and human error is an important and fascinating part of any sport, in my mind anyway.

Taylor wrapped up his talk by briefly describing an upcoming event that would be sponsored by, the GeekWire Sports Tech Summit. See The event is June 21-22, 2017 and I wish I could go but I made a speaking commitment elsewhere. Dang. But I’ll try to attend next year.

Taylor can be reached at — if you’re reading this post on or before June 20, and you want to attend the 2017 Sports Tech Summit, if you contact Taylor, I’ll bet he could wrangle a nice discount off the regular price for you.

Posted in Conferences, Miscellaneous | Leave a comment

Graphing Rastrigin’s Function using the Matplotlib Library

I thought I’d refresh my memory of the matplotlib library, an add-on package for Python that can create plots and graphs. I regularly use the plotting functions in R, SciLab, Excel, and Python, but using these isn’t easy or intuitive so I like to stay in practice.

One of my standard practice problems is to create a 3D graph of Rastrigin’s function. Rastrigin’s function is a standard benchmark optimization problem because it has many false local minima, but only one true global minimum at (0, 0).

Here’s one way to graph Rastrigin’s function using matplotlb:


from matplotlib import cm  # color map
from mpl_toolkits.mplot3d import Axes3D
import math
import matplotlib.pyplot as plt
import numpy as np

X = np.linspace(-4, 4, 200)    
Y = np.linspace(-4, 4, 200)    
X, Y = np.meshgrid(X, Y)

Z = (X**2 - 10 * np.cos(2 * 3.14 * X)) + \
  (Y**2 - 10 * np.cos(2 * 3.14 * Y)) + 20

fig = plt.figure()
ax = fig.gca(projection='3d')
surf = ax.plot_surface(X, Y, Z, rstride=1, \
  cstride=1, cmap=cm.jet)
# plt.savefig('rastrigin.png')

I don’t enjoy creating graphs, but in many situations graphs are extremely useful to help explain a machine learning example.

Posted in CNTK | Leave a comment

The Greatest Chess Tournament of All Time

I haven’t played an actual game of chess in many years, but I used to love to play chess in high school, when I had a lot of time. I still follow chess, mostly via the excellent

In June of 2017, an incredible chess tournament was held. The “Altibox Norway Chess Tournament” may well be the greatest chess tournament in history. The tournament had 10 players — and they were the top 10 ranked players in the world, by FIDE chess rating. The field included the current world champion, Magnus Carlsen (Norway, #1) and the previous two world champions, Vishy Anand (India, #7) and Vladimir Kramnik (Russia, #4). Also competing was the most recent championship challenger, Sergey Karjakin (Russia, #9).

The other six players were Wesley So (USA, #2), Fabiano Caruana (USA, #3), Maxime Vachier-Lagrave (France, #5), Hikaru Nakamura (USA, #6), Levon Aronian (Armenia, #8), and Anish Giri (Netherlands, #10).

In the history of chess, there’s only one or two other tournaments that might be considered as great or greater. The AVRO 1938 tournament had eight players, including four world champions, plus four other players. The eight players were widely acknowledged as the strongest in the world (there were no chess ratings in 1938). AVRO 1938 competitors were Alexander Alekhine (born in Russia, then champion), Jose Raul Capablanca (Cuba, former champion), Max Euwe (Netherlands, former champion), Mikhail Botvinnik (Soviet Union, future champion), Reuben Fine (USA), Samuel Reshevsky (USA), and Salo Flohr (Czechoslovakia). Keres and Fine tied for first place.

There have been many “great” chess tournaments, but “great” is subjective and hard to define. I don’t necessarily equate “greatest” with “strongest”. Some of the tournaments on my personal list of great tournaments includes Hastings 1895, St. Petersburg 1914, New York 1924, Santa Monica 1966, and Las Palmas 1996.

Anyway, the somewhat-of-a-surprise winner of Altibox 2017 was Levon Aronian, clear first with 6.0 (out of 9) points. Nakamura and Kramnik tied for 2nd and 3rd (5.0 points). Caruana, So, and Giri tied for 4th through 6th places (4.5 points). Vachier-Lagrave, Anand, and Carlsen tied for 7th through 9th (4.0 points). Karjakin finished 10th with 3.5 points.

The poor result of the current champion, Carlsen, was surprising and disappointing. There was immediate speculation that Carlsen has lost his burning desire to win. Well, only time will tell. I wish Carlsen had won, because having a world champion win a tournament adds to its “greatest”-ness in my opinion.

If two or more of the Altibox competitors eventually go on to win a world championship over the next 10 or so years, giving Altibox 2017 a total of five champions, then I’d rate Altibox as the greatest tournament ever. But until then, I’ll give the title to AVRO 1938.

Aronian and Carlsen:

Caruana and Kramnik:

Giri and Anand:

Nakamura and Vachier-Lagrave:

So and Karjakin:

Posted in Miscellaneous, Top Ten | 2 Comments