In computer science, and life, it helps to be smart but it’s also important to have determination. I’m not the smartest guy in the Universe, but once a problem gets stuck in my head it will stay there until it gets solved.
I’ve been looking at sentiment analysis using a PyTorch neural network with an EmbeddingBag layer. I started by looking at an example in the PyTorch documentation, but that example used the AG News dataset which has 1,000,000 short news snippets, which makes it extremely difficult to work with when you’re trying to dissect the example. Additionally, the demo used a built-in torchtext.datasets.AG_NEWS() class which magically serves up data in a special format — in real life you must deal with data wrangling yourself.
So, over the past couple of months I’ve been slowly but surely dissecting the documentation example so that I could create my own system. I hit a milestone recently when I got a complete end-to-end example working. I created 20 tiny movie reviews, each of which is labeled as 0 (negative sentiment) or 1 (positive). The goal is to train a neural model to correctly classify a tiny review as positive or negative.
In most natural language processing (NLP) problem scenarios, each word in a sequence/sentence is converted to an integer index using a Vocabulary object, and then the index representing the word is converted to a numeric vector of about 100 values, called a word embedding. Each word embedding is sequentially fed to an extremely complex neural system — typically an LSTM for moderate length input or a Transformer for long length input.
An EmbeddingBag layer converts an entire sequence/sentence to a numeric vector. This is dramatically simpler than a word embedding approach — but still extremely tricky (just like all NLP problems).
I intend to tidy up my demo program and write up an explanation and then publish it in Microsoft Visual Studio Magazine. Even though the demo program is only about 200 lines long, it is very dense in terms of ideas so my explanation will likely take two or three articles.
People who aren’t programmers or developers or data scientists don’t understand our world (if you’re reading this blog post, you are probably part of “our world”). We don’t relentlessly work on difficult problems because of some external force — we do so because our brains are wired that way.
The history of computer science is one largely of men who had relentless determination to create. Left: Wilhelm Schickard (1592–1635) designed, but did not build, a “calculating clock” that would have performed addition, subtraction, multiplication and division. Center: The Z1 mechanical computer was built by Konrad Zuse (1910-1995) in 1937. It weighed about 2,000 pounds and had 20,000 parts. The Z1 contained almost all the parts of a modern computer but wasn’t reliable. Right: In 1978, some MIT students built a tinker toy mechanical computer from 10,000 parts and fishing line. It was hard-wired to play tic-tac-toe.