Preparing Text for an LSTM Recurrent Neural Network

One of the really interesting deep learning techniques is text analysis with an LSTM (“long, short-term memory”) recurrent neural network. LSTMs can work with sequences of text because, unlike other kinds of neural networks, LSTMs have memory.

Preparing data for an LSTM is a major challenge. I decided to work with the text from the Sherlock Holmes novel “A Study in Scarlet”. The first few words of the novel, after the title and the table of contents, are, “In the year 1878 I took my degree of Doctor of Medicine of the University of London, and proceeded . . ”

Conceptually, the idea is to create a training dataset that is like:

In the year 1878 | I 
the year 1878 I  | took
year 1878 I took | my
. . .

Each sequence of four words is used to predict the next word. This is “rolling window” data where the size of the window, four in this case, must be determined by trial and error.

However, all neural networks, including LSTMs, only understand numeric values. As it turns out, if you use a naive approach and just assign an integer to each word, well, it just doesn’t work. For example, if ‘In’ = 1, ‘the’ = 2, ‘year’ = 3, ‘1878’ = 4, ‘I’ = 5, and so on, the conceptual training data would look like:

1 2 3 4 | 5
2 3 4 5 | 6
3 4 5 6 | 7
. . .

The standard technique of using one-hot encoding doesn’t seem to work very well either. If you had a total of 10 different words, then ‘In’ = (1 0 0 0 0 0 0 0 0 0), ‘the’ = (0 1 0 0 0 0 0 0 0 0), ‘year’ = (0 0 1 0 0 0 0 0 0 0), etc. The problem here is that even a small training corpus would have thousands of distinct words, so each one-hot vector would be huge.

The usual approach is to create a “word embedding” where each word is represented by a vector of two or more values, where the vectors are constructed in an extremely clever way so that similar words have similar vectors.

I wrote a utility program that created a training file for an LSTM using “A Study in Scarlet”. The process was tricky and took me my entire lunch hour — and I’m quite good at code like this.

It would take several pages to explain the multi-step process, but the screenshot below summarizes what is going on. A very interesting challenge. Next step is to see if I can build an LSTM model that can understand the structure of a Sherlock Holmes novel.

Note: Depending on the type of analysis I’m going to do, I may have to one-hot encode the word-to-predict.

“A Study in Scarlet” was published in an annual magazine (like a paperback novel today) in November 1887, along with two other short stories by other authors. It was the first appearance of Sherlock Holmes.