I successfully implemented an LSTM network using CNTK with Word2Vec embeddings. Let me explain. I started with a paragraph of the Sherlock Holmes novel “A Study in Scarlet”. The first couple of sentences (converted to lower case, punctuation removed) are:
in the year 1878 i took my degree of doctor of medicine of the university of london and proceeded to netley to go through the course prescribed for surgeons in the army having completed my studies there i was duly attached to the fifth northumberland fusiliers as assistant surgeon
My goal was to create a prediction model — given N words, what is the next word? For example, if N = 4 and the input sequence is “year 1878 i took” then the model should predict “my”. First I converted all the words to index values: “in” = 0, “the” = 1, “year” = 2, and so on. Now in theory I could have used these index values directly. But a much better approach is to covert each word/index to a numeric vector of floating point values.
This approach is called an embedding. I used the Word2Vec tool to create the embeddings for each of the 86 distinct words in the source text. I set the vector length to 32. The result for “the” was
Vector for 'the' is: [ 3.0290568e-03 1.1347506e-02 2.5496054e-03 -1.3096497e-02 -5.7233768e-03 9.1301277e-03 -2.6647178e-03 1.2957667e-02 -3.7651435e-03 -1.0592117e-02 -6.0152885e-05 8.1940945e-03 -1.1889883e-02 -1.5280096e-02 4.6902723e-03 -1.0119098e-02 -1.0269336e-02 -9.8525938e-03 -8.9324228e-03 1.3820899e-02 8.8472795e-03 -1.0620472e-02 1.3961374e-03 1.3016418e-02 -9.3864333e-03 -1.1885420e-02 7.3955222e-03 1.3285194e-02 1.1789358e-02 8.3396314e-03 -8.4532667e-03 -4.6083345e-03]
Next I created a data file for a CNTK network. The data file looked like:
0 |curr -0.86816233 0.28763667 . . 1.50366807 |next 4:1 0 |curr 0.30290568 1.13475062 . . -0.46083345 0 |curr -0.65285438 -0.69098999 . . 1.46716731 . . .
By the way, figuring out each of these steps was rather difficult and each took several days of work. I was stuck on the CNTK sequence format until I got some valuable information from a colleague, William Darling. Without that key information, I’d still be stuck.
I’m leaving out tons of details. For example, CNTK has a built-in Embedding layer you can use instead of Word2Vec embeddings. And the built-in Embedding layer can accept a text file of the Word2Vec vector values. And many other details.
With my data ready at last, I ran a program to train the model and it failed spectacularly until I noticed the Word2Vec vector values were very small (like 0.0001234), and so I scaled them up by multiplying by 100 (see the data snippet above).
Finally, after weeks of work, I was able to create an LSTM network model of the first paragraph of “A Study in Scarlet” using Word2Vec embeddings.
Check out William’s excellent video about machine learning for sequences at https://www.youtube.com/watch?v=Vi05nEzAS8Y