Machine Learning with Natural Language

Natural language processing (NLP) is an important area of machine learning (ML). The Hello World problem for NLP is to take a set of text, such as a paragraph or entire book, and then create a model that when given a word in the text, predicts the next word.

Ordinary ML techniques can’t handle such a problem because the next word in some text doesn’t just depend on the previous word, it depends on many previous words. For example, if I asked you what word follows “brown” you’d have to take a wild guess but if I told you the previous words were “the”, “quick”, you’d probably guess the next word is “fox”.

NLP is quite difficult. The first step is to encode the source text because ML systems only understand numbers. One common way (but by no means the only way, or the best way) is to use “one-hot” encoding (also called “1-of-N” encoding. Suppose you have just 10 words in your source text: “There must be some kind of way out of here.” Then there are 9 distinct words (“of” is repeated). The nine words could be encoded as “There” = (1,0,0,0,0,0,0,0,0), “must” = (0,1,0,0,0,0,0,0,0), “be” = (0,0,1,0,0,0,0,0,0), . . . “here” = (0,0,0,0,0,0,0,0,1).

So, when doing NLP, you have to spend a lot of time massaging the source text data. I took a few lines from the James Bond novel “Dr. No” and wrote a utility program that created a text file suitable for use by the CNTK ML code library. The source text is:

Bond watched the big green turtle-backed island grow on the horizon
and the water below him turn from the dark blue of the Cuba Deep to
the azure and milk of the inshore shoals . Then they were over the
North Shore , over its rash of millionaire hotels , and crossing
the high mountains of the interior . The scattered dice of
small-holdings showed on the slopes and in clearings in the jungle
 , and the setting sun flashed gold on the bright worms of tumbling
rivers and streams .

The output for the first three pairs of words is:

|prev 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 |next 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

|prev 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 |next 0 0 1 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

|prev 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 |next 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0

. . .

The utility program scanned the source text and determined how many unique words there are to determine the length for one-hot encoding. Then the source text was scanned again, and each unique word was inserted into a Dictionary object. For example, word_dict[“Bond”] = 39 and word_dict[“the”] = 2. The utility also created a reverse dictionary, for example indx_dict[39] = “Bond”.

NLP can get very, very complicated, but if you’re new to NLP, you just have to learn one step at a time.


“Hot 9” – Jackson Pollock

Advertisements
This entry was posted in CNTK, Machine Learning. Bookmark the permalink.