I wrote an article titled “Deep Neural Network Training” in the September 2017 issue of Microsoft MSDN Magazine. See https://msdn.microsoft.com/en-us/magazine/mt842505.
A regular neural network has a set of input nodes a set of hidden processing nodes, and a set of output nodes. Each node-to-node connection has a numeric constant, called a weight, that must be determine during training. (Each hidden node and each output node has a numeric constant, called a bias, that can be thought of as additional weights).
Training is the process of determining the values of the neural network weights. You do this by using a set of so-called training data that has known input values, and known, correct output vales. You use one of many possible algorithms to find values for the weights so that the computed output values closely match the known, correct output values. The most common training algorithm is called back-propagation.
Deep neural networks (DNNs) are just like regular NNs except DNNs have two or more layers of hidden nodes. In theory, a regular NN can approximate any continuous prediction function (the universal approximation theorem, or the Cybenko theorem). But in practice, DNNs can solve more difficult problems than regular NNs.
In my article, I describe the vanishing gradient problem. The back-propagation algorithm uses the Calculus derivative. Each node-to-node weight has a gradient. As more hidden layers are added to a DNN, the gradients of the weights closer to the input nodes (and therefore farther from the output nodes) tend to become smaller and smaller. There’s some tricky math, but essentially there are a lot of multiplications using the gradients (which are often values less between 0 and 1, so the products can go very close to zero, and training progress halts.
My article uses a dummy NN with just three hidden layers. Real DNNs used in speech an image recognition can have over 100 hidden layers and then the vanishing gradient becomes a real problem. I discuss a few of the approaches used to counteract the vanishing gradient problem.