I wrote an article titled “Sentiment Analysis Using a PyTorch EmbeddingBag Layer” in the July 2021 edition of the online Microsoft Visual Studio Magazine. See https://visualstudiomagazine.com/articles/2021/07/06/sentiment-analysis.aspx.
Natural language processing (NLP) problems are very difficult. A common type of NLP problem is sentiment analysis. The goal of sentiment analysis is to predict whether some text is positive (class 1) or negative (class 0). For example, a movie review of, “This was the worst film I’ve seen in years” would be classified as negative.
In situations where the text to analyze is long — say several sentences with a total of 40 words or more — two popular approaches for sentiment analysis are to use an LSTM (long, short-term memory) network or a Transformer Architecture network. These two approaches are very difficult to implement. For situations where the text to analyze is short, the PyTorch code library has a relatively simple EmbeddingBag class that can be used to create an effective NLP prediction model.
In my article, I present a complete end-to-end demo. The source data is 20 short movie reviews. I explain nine steps:
1. How to create and use a tokenizer object
2. How to create and use a Vocab object
3. How to create an EmbeddingBag layer and use it in a neural network
4. How to design a custom collating function for use by a DataLoader
5. How to design a neural network that uses all these components
6. How to train the network
7. How to evaluate the prediction accuracy of the trained model
8. How to use the model to make a prediction for a movie review
9. How to integrate all the pieces into a complete working program
The two key distinguishing characteristics of an NLP system that uses an EmbeddingBag layer are 1.) an EmbeddingBag layer system is much simpler than a system that uses a regular Embedding layer. Note however that EmbeddingBag layer systems are still very complex. And 2.) an EmbeddingBag layer system does not use information about the order of words in a sentence. This means EmbeddingBag layer systems work best with short-input sentences (perhaps about 20 words or fewer).
Deep neural systems have completely revolutionized natural language processing systems. Years ago, I worked on one of the very first Internet search engines — before Google existed. Dealing with free form input that users typed into the search box was a big headache. We had to do all kinds of crazy things like handcrafted stemming (reducing words to simple form) and lemmatization (dealing with different contexts, such as the meaning of “pound” to a doctor in the US, and to a banker in the UK).
Airport Bags. Left: We all know some women who pack like this for a two-day trip. Left-Center: Anyone who has children can understand this photo of a child embedded into a bag. Right-Center: Many people put something on their luggage to easily identify it. Right: Rolling luggage has wheels on it for a purpose — but maybe not this.