The neural network back-propagation algorithm has many subtle details. If you assume a squared error function with squared output minus target, the weight update rule adds the delta-weight. If you assume target minus output, the weight update rule subtracts the delta-weight.
This is very, very difficult to explain, but I’ll try. In the image (click to enlarge), hj is a hidden node and ok is an output node. The wjk is the weight connecting them. The tk is the target value. In the image the value of ok before softmax activation is +1.20 and the final ok value, after softmax, is 0.80 (which you’ll just have to assume because the other output nodes aren’t shown).
Because ok = 0.80 and tk = 1, we want the value of ok to increase, so it will get closer to the 1 target value.
As it turns out, there are 8 cases to analyze, depending on the sign of hj, the sign of wjk and whether the computed output needs to increase or decrease in order to get closer to the target. I’ll assume positive hj, positive wjk, and that you want to increase ok, as shown.
In the image, I give the equation for the gradient, assuming that error uses squared output minus target. In this scenario, everything is positive, so the gradient will be positive (+0.0192 in the example). Notice that this is because the gradient has a target minus output term — the opposite of the error function’s output minus target (because of how the Calculus derivative works).
Now, at this point, to compute the weight-delta you multiply a small positive learning rate constant eta (looks kind of like script lower case ‘n’) times the gradient, so the weight-delta will be positive. If you add the positive weight-delta to the current positive weight, wjk, the value of wjk increases (because the weight-delta is positive).
With the new increased weight value, when you compute the new output value by multiplying h * wjk, the output value will increase as desired.
Whew! Very tricky. If you analyze the other seven cases, you’ll see that if you assume error is defined using output minus target, then adding the weight-delta always moves the weight in the correct direction, positive or negative, so that the computed output will get closer to the target.
But, suppose at the very beginning, you assume the error function is defined using squared target minus output instead of output minus target as I’ve just explained. Then the gradient will have an output minus target term, reversing the sign compared to the analysis above, and in the end, the update rule would be to subtract the weight-delta instead of add the weight-delta.
Moral: There are many references that explain back-propagation in theory, but there are many, many details that you must deal with in practice.