Encoding Words for Machine Learning Analysis using Word2Vec

Neural networks understand only numbers. Therefore, if you are working with text, words must be converted into numbers. Suppose you have a corpus — a document(s) of interest. You could assign an integer to each word. For example, if the text started with, “In the beginning” then you could set “In” = 1, “the” = 2, “beginning” = 3, and so on.

But assigning values like this just doesn’t work very well because of how neural networks operate. Briefly, because 1 and 2 are close together numerically, “In” and “the” would be considered very close.

The Word2Vec (“word to vector”) system is one of the best ways to encode words. Briefly, each word is assigned a vector of numbers in a very clever way so that similar words have similar numeric values in the vector. There are several implementations of Word2Vec but I prefer the one in the gensim (the name originally stood for “generate similar” text) Python library.

I wrote a short demo. First I installed the gensim Python package using “pip install gensim”. The I wrote a Python script. My dummy corpus consisted of just three sentences. In a real scenario, your corpus could be huge, such as all of Wikipedia, or hundreds of thousands of news stories. I hard-coded my corpus like so:

sentences = [['In', 'the', 'beginning', 'God', 'created',
              'the', 'heaven', 'and', 'the', 'earth.', 
              'And', 'the', 'earth', 'was', 'without',
              'form,', 'and', 'void;', 'and', 'darkness',
              'was', 'upon', 'the', 'face', 'of', 'the',
              'deep.', 'And', 'the', 'Spirit', 'of', 'God',
              'moved', 'upon', 'the', 'face', 'of', 'the',
              'waters.']]

In a real problem, setting up the corpus is the hard part. You have to deal with punctuation, capitalization, and so on. In this demo I hard-coded the corpus as a list-of-lists. In a non-demo scenario, I’d likely read a corpus from a (UTF-8) text file like: sentences = word2vec.Text8Corpus(‘C:\\Data\\Corpuses\\whatever.txt’).

I built a model, specifying 10 values for each word vector (in a realistic large corpus, you’d use something like 100 or 200 values per word). Then I displayed the values for the word ‘earth’:

[ 0.01721778 -0.03160927 -0.01329765
-0.03671417 0.03356135 -0.03182576
-0.00196723 0.01548103 -0.02937444
0.04018674]

If you had a neural network, this is what you’d feed the network instead of the word ‘earth’. The Word2Vec library can do all kinds of additional capabilities. It’s a remarkable library.

# word_to_vec_demo.py

from gensim.models import word2vec
import logging

logging.basicConfig(format='%(asctime)s : \
  %(levelname)s : %(message)s', level=logging.INFO)

sentences = [['In', 'the', 'beginning', 'God', 'created', 'the',
 'heaven', 'and', 'the', 'earth.', 'And', 'the', 'earth', 'was',
 'without', 'form,', 'and', 'void;', 'and', 'darkness', 'was',
 'upon', 'the', 'face', 'of', 'the', 'deep.', 'And', 'the',
 'Spirit', 'of', 'God', 'moved', 'upon', 'the', 'face',  'of',
 'the', 'waters.']]

print("\nBegin training model on corpus")
model = word2vec.Word2Vec(sentences, size=10, min_count=1)
print("Model created \n")

print("Vector for \'earth\' is: \n")
print(model.wv['earth'])

print("\nEnd demo")


The Multiverse

Advertisements
This entry was posted in Machine Learning. Bookmark the permalink.