Mini-Batch Neural Network Training

I wrote an article titled “Variation on Back-Propagation: Mini-Batch Neural Network Training” in the July 2015 issue of Visual Studio Magazine. See https://visualstudiomagazine.com/articles/2015/07/01/variation-on-back-propagation.aspx.

A neural network is a complicated math function that has many constant values called weights that, along with the input values, determine the output values. Training a neural network is the process of finding the values of the weights. This is accomplished by using a set of training data that has known input values and known, correct output values.

There are many algorithms to train a neural network. By far the most common is the back-propagation algorithm. Back-propagation works by calculating a set of values called the gradients. Gradients are calculus derivatives that indicate how to adjust the current set of weight values so that when the NN is fed the training data input values, the calculated output values get closer to the known correct output values. There is one gradient value for each NN weight.

There are three variations of back-propagation. The first variation is called batch training. In this variation, all the training items are used to calculate the weight gradients, and then each weight value is adjusted.

The second variation is called online, or stochastic, training. In this variation, the gradients are calculated for each individual training item (giving an estimate of the gradients for the entire data set), and then each weight value is adjusted using the estimated gradients.

The third variation is called mini-batch training. In this variation, a batch of training items is used to compute the estimated gradients, and then each weight value is adjusted using the estimated gradients.

For example, suppose a NN has 86 weights. If there are 500 training items then batch training computes the 86 gradients using all 500 training items and then updates each weight, so there’d be one set of updates for one pass through the data set.

In online (stochastic) training, the 86 gradients are estimated using one of the 500 data items at a time, and after each estimate, all 86 weights are updated, and so there’d be 500 sets of updates for one pass through the training data.

For mini-batch training, if the batch size is set to 100, then 100 training items are used to estimate the 86 gradients, and then the 86 weight would be updated. This would happen 500 / 100 = 5 times so there’d be 5 sets of updates for one pass through the training data.

There has been much research and discussion about which form of back-propagation works best. In my opinion, based on the research I’ve seen, there’s no one best approach and so the best approach, batch, online, or mini-batch, depends on the problem under investigation.