Working with IMDB Movie Review Data Vocabulary Collections

I’ve been working with LSTM (long short-term memory) and TA (transformer architecture) systems for natural language processing (NLP) recently. NLP problems are very difficult. My standard experiments use the IMDB movie review dataset.

During the process of creating IMDB movie review training and test data files from the raw source data, I created a file that contains vocabulary information. The file looks like:

the      1
and      2
a        3
of       4
. . .
whelkbr  129888

The data is 1-based and is ordered by frequency of the word in the source data. The source IMDB movie review data has 129,888 distinct words, based on how I tokenized — split words/tokens on blank space character, convert all words to lower case, remove all punctuation except for the single-quote character. The last several hundred words are the rarest and are mostly misspellings.

Code to create an in-memory vocabulary dictionary using just the 1,000 most common words is:

  vocab_file = ".\\Data\\vocab_file.txt"
  vocab = dict()  # key = word/token, value = integer ID
  i = 0
  f = open(vocab_file, 'r', encoding='utf-8')
  # vocab file is 1-based where "the" = 1, "and" = 2, etc.
  for line in f:
    word_id = line.split(" ")
    word = word_id[0]; id = int(word_id[1])
    vocab[word] = id + 3  # IDs 0,1,2,3 reserved
    i += 1
    if i == 999: break
  # vocab dict is offset by 3: "the" = 4, "and" = 5, etc.

The in-memory vocabulary object is offset by 3 to allow special IDs of 0 (padding), 1 (start-of-sequence), 2 (out-of-vocabulary), and 3 (unused).

Example code to convert a movie review from words/tokens to IDs is:

  review = "the movie was a great waste of my time"
  print("Review = ")
  review_ids = []
  review_words = review.split(" ")
  for w in review_words:
    if w not in vocab:
      id = 2  # out-of-vocab
      id = vocab[w]
  print("Review IDs = ")

This code would be needed if you have trained an LSTM or TA model to do something, and you wanted to feed a new, previously unseen movie review to the model.

Example code to create a reverse vocabulary that accepts a word ID and returns the corresponding word is:

  id_to_word = dict()  # could use list or array
  id_to_word[0] = ""
  id_to_word[1] = ""
  id_to_word[2] = ""
  for (k,v) in vocab.items():
    id_to_word[v] = k  # "the" = 4, "and" = 5, etc.

The reverse vocabulary object could be used like this:

  review_ids = [13, 22, 16, 2000, 314] 
  print("\nReview IDs = ")
  print("Review as words: ")
  for id in review_ids:
    if id not in id_to_word:
      w = "(UNK)"
      w  = id_to_word[id]

This code is useful during development of an LSTM or TA system to debug problems.

Working with vocabulary dictionary collections for NLP is not conceptually difficult, but it’s easy to make mistakes related to the offset, and dealing with error conditions such as trying to access a dictionary key that isn’t in the dictionary.

Most of the NLP projects I’ve worked on required a custom vocabulary built from text specific to the problem, rather than a generic vocabulary of English words. I suspect that a vocabulary collection created from science fiction stories and novels would be quite different from a vocabulary collection created from movie reviews. Left: Cover art by Kelly Freas (June 1958). Center: Cover art by Milton Luros (February 1954). Right: Cover art by Lawrence Stevens (September 1949).

This entry was posted in Machine Learning. Bookmark the permalink.

1 Response to Working with IMDB Movie Review Data Vocabulary Collections

  1. Pingback: Sentiment Classification of IMDB Movie Review Data Using a PyTorch LSTM Network – Visual Studio Magazine – Auto Robot Demo

Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s