Word Similarity using GloVe

The GloVe (“global vectors for word representation”) data maps an English word, such as “love”, to a vector of values (for example 100 values). See https://nlp.stanford.edu/projects/glove/

There are different versions of GloVe. One of the simplest used Wikipedia as it source (six billion non-unique words) and then extracted 400,000 distinct words, and then used a neural network to generate a vector of 100 values for each word.

The vectors are generated in a very clever way so that two semantically similar words have mathematically similar vectors. So, if you want to find words that are semantically close to the word “chess”, you’d get the GloVe vector for “chess”, then scan through the other 399,999 GloVe vectors, finding the vectors that are close (using Euclidean distance). Then you’d map the close vectors back into words.

GloVe is useful when the particular data you are using is general in nature. But if you have highly specialized text, such as legal text, or medical text, then you’re usually better off by creating your own custom word embedding vectors using the gensim tool.

Neat. Neural methods have really revolutionized natural language processing.

Image query: “painting of a woman with gloves” (left) and “a woman painting with gloves” (right). Natural language processing is tricky.

This entry was posted in Machine Learning. Bookmark the permalink.