Creating a PyTorch Vocabulary Object for Natural Language Processing Problems

Even the simplest natural language processing problem is extremely difficult. With a basic neural network classifier, you usually have to normalize numeric data (such as dividing a person’s age by 100) and encode non-numeric data (e.g., “red” = 1,0,0) but then you’re good to go.

Preparing NLP data is very time-consuming. One of many preparation tasks is creating a Vocabulary object. A Vocabulary object maps each word in the problem context to a 0-based integer index, based on how common the word is. For example, if you had some more or less normal source text, the word “the” might be mapped to 4 if it was the fifth most common word/token in your source text.

To map each word in your source text, you have to split sentences into words, called tokenizing, and then usually normalize by converting to lower case. But even this is extremely difficult — dealing with punctuation, and situations such as “Smith”, a person’s name, and “smith” a profession, are troublesome in practice.

A completely revamped version of the PyTorch Torchtext library for NLP was released a couple of months ago. I am slowly but surely wading through all the new ways of working with NLP problems with PyTorch.

Today I set out to code up a demo of making a Vocabulary object. First I created a dummy data file of six movie reviews labeled 0 = bad, 1 = good:

0, This was a BAD movie.
1, I liked this film! Highly recommeneded.
0, Don't waste your time - a real dud
1, Good film. Great acting.
0, This was a waste of talent.
1, Great movie. Good acting.

Then I wrote a function to make a Vocabulary object from the reviews data:

# make_vocab.py
import torchtext as tt
import collections

def make_vocab(fn):
  toker = tt.data.utils.get_tokenizer("basic_english")
  counter_obj = collections.Counter()
  f = open(fn, "r")
  for line in f:
    line = line.strip()
    txt = line.split(",")[1]
    split_and_lowered = toker(txt)
    counter_obj.update(split_and_lowered)
  f.close()
  result = tt.vocab.Vocab(counter_obj, min_freq=1)
  return result

def main():
  print("\nBegin make PyTorch Vocab demo ")

  fn = ".\\Data\\reviews.txt"
  vocab = make_vocab(fn)
  print("\nVocab itos: ")
  for i in range(len(vocab)):
    print(i, vocab.itos[i])
 
  print("\nEnd demo ")

if __name__ == "__main__":
  main()

There is a lot going on here. The torchtext library has several built-in tokenizers. I used “basic_english” which splits words based on blank spaces and converts words to lower case. The topic of tokenizers is huge, but that’s for another post.

The Collections library has a convenient Counter object that keeps track of how many times each word occurs. The torchtext Vocab object accepts a counter object.

Without the built-in Vocab, Counter, and Tokenizer objects, the demo code would be about four times as long (I’ve implemented custom Vocab objects from scratch in the past and it’s a lot of work).

After the Vocabulary object was created, the demo code walked through each integer index value and displayed it’s associated word/token:

0 (unk)
1 (pad)
2 .
3 a
4 this
. . .
26 time
27 your

Index values of 0 and 1 are used by default to represent an unknown word, and padding for problems that require all input sentences to have the same length — padding is yet another big topic.

The period character was the most common token in the source movie reviews — 7 times. The word “a” was the second most common — 3 times. And so on. Mapping words/tokens by frequency allows you to filter out very rare words. Notice that I misspelled the word “recommeneded’ which would end up being a rare word.

Anyway, not good fun, but an interesting exploration. I have many more little experiments to do before I get a full understanding of all the new Torchtext functions.


Converting words to index values is one thing. Converting a woman to a creature is something else. Here are three movies that wouldn’t get good reviews but they’re a lot of fun. Left: “The Wasp Woman” (1959). The woman founder of a cosemetics company is aging but finds she can rejuvenate her looks with wasp serum. It has predictable side effects. Center: “The Leech Woman” (1960). A middle-aged woman discovers a secret African potion that makes her look young again — for a few weeks. Unfortunately it requires the pineal gland of a man. Right: “The Snake Woman” (1961) – A doctor gives his sick wife snake venom to cure her. It works but their daughter Atheris turns into a deadly cobra every now and then.


# make_vocab.py

# import numpy as np
# import torch as T
import torchtext as tt  # v0.9
import collections

# data file looks like:
# 0, This was a BAD movie.
# 1, I liked this film! Highly recommeneded.
# 0, Don't waste your time - a real dud
# 1, Good film. Great acting.
# 0, This was a waste of talent.
# 1, Great movie. Good acting.

def make_vocab(fn):
  toker = tt.data.utils.get_tokenizer("basic_english")
  counter_obj = collections.Counter()
  f = open(fn, "r")
  for line in f:
    line = line.strip()
    # print(line)
    txt = line.split(",")[1]
    split_and_lowered = toker(txt)
    counter_obj.update(split_and_lowered)
  f.close()
  result = tt.vocab.Vocab(counter_obj, min_freq=1)
  return result

def main():
  print("\nBegin make PyTorch Vocab demo ")

  fn = ".\\Data\\reviews.txt"
  vocab = make_vocab(fn)
  print("\nVocab itos: ")
  for i in range(len(vocab)):
    print(i, vocab.itos[i])
 
  print("\nEnd demo ")

if __name__ == "__main__":
  main()
This entry was posted in PyTorch. Bookmark the permalink.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s