A basic neural network (NN) has no memory of previous inputs or outputs. This means a NN has great trouble predicting the next token in a sequence. For example, suppose you want to predict the next word in the sentence, “I like pasta so tonight I’ll eat (blank).” A reasonable prediction would be “spaghetti” but a basic neural network only sees one word at a time and probably wouldn’t be able to do well on this prediction problem.
In the 1990s a special type of NN called a recurrent neural network (RNN) was devised. Each input is combined with the output of the previous iteration. For example, when presented with the word “so” a RNN will remember the output from the previous input, “pasta”. In this way each input has a trace of memory of previous outputs.
To implement a simple RNN isn’t too difficult conceptually, but it’s quite a chore in practice. One engineering strategy is to create a composable module that can be chained together. Expressed as a diagram:
Here Xt is the current input (typically a word or a letter). The Yt is the output (a vector of probabilities that represent the likelihood of each possible next word). The box labeled tanh is a layer of hidden processing nodes and a layer of output nodes — essentially a mini neural network. But notice that just before the Xt input reaches the tanh internal NN the output from the previous item is concatenated to the input.
Once such a module has been implemented, you can chain them together like this:
This is pretty cool. However, in practice, these very simple RNNs just don’t perform well. The main problem is that they just can’t remember enough. This gave rise to more sophisticated forms of RNNs, in particular the oddly-named but very effective “long, short-term memory” network (LSTM) network.
So why even bother with the simple RNNs? Because in order to implement LSTMs and even more exotic networks, understanding basic RNNs is a good way to start. So that’s what I’m doing.