IMDB Movie Review Sentiment Analysis Using an LSTM with PyTorch

When I was first learning PyTorch, I implemented a demo of the IMDB movie review sentiment analysis problem using an LSTM. I recently revisited that code to incorporate all the things I learned about PyTorch since that early example.

My overall approach is to preprocess the IMDB data by encoding each word as an integer ID, rather than encoding on the fly during training. IDs are sorted by frequency where small ID numbers are the most common words. This makes it easy to filter out of rare words like “floozle”. Preparing the raw movie data is the most difficult part of creating the sentiment analysis system.

I created a root directory named IMDB with subdirectories Data and Models. I downloaded the 50,000 movie reviews from https://ai.stanford.edu/~amaas/data/sentiment/ as aclImdb_v1.tar.gz to the root IMDB directory, then unzipped using the 7-Zip utility to get file aclImdb_v1.tar, and then I unzipped that file to get an aclImdb directory that contains all the movie reviews. I moved that directory and its contents into the Data directory.

Here I illustrate the data preprocessing for tiny reviews that are 20 words or less. Notice there is a duplicate review.

The goal of my preprocessing is to create files imdb_train_50w.txt and imdb_test_50w.txt for training and testing respectively. These are files where the movie reviews are very small — 50 words or less — because working with the entire dataset of reviews is very difficult. This generated just 620 training items/reviews which is too few to get good results. In a non-demo NLP scenario you need several thousand training items.

The words in the reviews are tokenized into integer values like “the” = 4 and “movie” = 20. I reserved 0 for (PAD) to pad all reviews to exactly 50 words. Most punctuation is stripped out and all words are converted to lower case. Each line in the train and test files is one review where padding is at the beginning and the class label to predict (0 = negative, 1 = positive) is the last value on each line. This preprocessing script is complicated and took me several days of coding and debugging. See the code below.

The program to create and train a sentiment analysis model using a PyTorch LSTM also took several days of work. The model definition is:

import numpy as np
import torch as T
device = T.device('cpu')

class LSTM_Net(T.nn.Module):
  def __init__(self):
    # vocab_size = 129892
    super(LSTM_Net, self).__init__()
    self.embed = T.nn.Embedding(129892, 32)
    self.lstm = T.nn.LSTM(32, 75)
    self.drop = T.nn.Dropout(0.10)
    self.fc1 = T.nn.Linear(75, 10)  # 0=neg, 1=pos
    self.fc2 = T.nn.Linear(10, 2)

  def forward(self, x):
    # x = review/sentence. length = 50 (fixed w/ padding)
    z = self.embed(x) 
    z = z.view(50, 1, 32)  # "seq batch input"
    lstm_oupt, (h_n, c_n) = self.lstm(z)
    z = lstm_oupt[-1]
    z = self.drop(z)
    z = T.tanh(self.fc1(z)) 
    z = self.fc2(z)  # CrossEntropyLoss will apply softmax
    return z

There are virtually unlimited design choices for an LSTM-based network. There are no good rules of thumb for design — it’s all trial and error guided by experience.

The make_data_files.py data preprocessing program determined that there are 129,892 distinct words/tokens in the entire training data. This is far too many words to get good results so in a non-demo scenario I’d filter the vocabulary down to just the 10 or 20 thousand most common words/tokens.

Each word ID in an input review is converted into an embedding vector of 32 values (in a non-demo scenario 100 values for embedding is more common). The LSTM component converts these to 75 values. These 75 values are passed to two Linear layers that map down to 10 values then down to 2 final result values (0 or 1).

For simplicity, during training I used a batch size of 1, meaning I processed just one review at a time. In a non-demo scenario, I’d probably use a batch size of 16.

After the model was trained, I fed a movie review of “the movie was a great waste of my time” to the model. I converted each word manually: “the” = 4, “movie” = 20, “was” = 16, etc., by using the vocab_dict dictionary in the make_data_files.py program. In a non-demo scenario, I would have programmatically determined the ID values for each word by using the vocab_file.txt file that was generated by the data preparation program.

The prediction result for the review was [0.9984, 0.0016] which maps to class 0, which is a negative review.

Whew! Natural language processing problems like movie review analysis are mysterious and very, very difficult. But very, very interesting.

Three mystery movies that I give positive sentiment reviews to. Left: “Murder on the Orient Express” (1974). Center: “Sherlock Holmes and the House of Fear” (1945). Right: “The Nice Guys” (2016).

Demo code for make_data_files.py. Replace “lt” (less-than), etc. with symbols. My blog editor chokes on those symbols.

# make_data_files.py
#
# input: source Stanford 50,000 data files reviews
# output: one combined train file, one combined test file
# output files are in index version, using the Keras dataset
# format where 0 = padding, 1 = 'start', 2 = OOV, 3 = unused
# 4 = most frequent word ('the'), 5 = next most frequent, etc.
# i'm skipping the start=1 because it makes no sense here.
# these data files will be loaded into memory then feed
# a built-in Embedding layer (rather than custom embeddings)

# the reviews will be just those that have 50 words or less.
# short reviews will have 0s pre-pended. the class
# label (0 or 1) is the very last value.

import os

# allow the Windws cmd shell to deal with wacky characters
import sys
import codecs
sys.stdout = codecs.getwriter('utf8')(sys.stdout.buffer)

# -------------------------------------------------------------

def get_reviews(dir_path, num_reviews, punc_str):
  punc_table = {ord(char): None for char in punc_str}  # dict
  reviews = []  # list-of-lists of words
  ctr = 1
  for file in os.listdir(dir_path):
    if ctr "gt" num_reviews: break
    curr_file = os.path.join(dir_path, file)
    f = open(curr_file, "r", encoding="utf8") 
    for line in f:
      line = line.strip()
      if len(line) "gt" 0:  # number characters
        # print(line)  # to show non-ASCII == errors
        line = line.translate(punc_table)  # remove punc
        line = line.lower()  # lower case
        line = " ".join(line.split())  # remove consecutive WS
        word_list = line.split(" ")  # list of words
        reviews.append(word_list)    # 
    f.close()  # close curr file
    ctr += 1
  return reviews

# -------------------------------------------------------------

def make_vocab(all_reviews):
  word_freq_dict = {}   # key = word, value = frequency

  for i in range(len(all_reviews)):
    reviews = all_reviews[i]
    for review in reviews:
      for word in review:
        if word in word_freq_dict:
          word_freq_dict[word] += 1
        else:
          word_freq_dict[word] = 1

  kv_list = []  # list of word-freq tuples so can sort
  for (k,v) in word_freq_dict.items():
    kv_list.append((k,v))

  # list of tuples index is 0-based rank, val is (word,freq)
  sorted_kv_list = \
    sorted(kv_list, key=lambda x: x[1], \
      reverse=True)  # sort by freq

  f = open(".\\vocab_file.txt", "w", encoding="utf8")
  vocab_dict = {}  
  # key = word, value = 1-based rank 
  # ('the' = 1, 'a' = 2, etc.)
  for i in range(len(sorted_kv_list)):
    w = sorted_kv_list[i][0]  # word is at [0]
    vocab_dict[w] = i+1       # 1-based as in Keras dataset

    f.write(w + " " + str(i+1) + "\n")  # word-space-index
  f.close()

  return vocab_dict

# -------------------------------------------------------------

def generate_file(reviews_lists, outpt_file, w_or_a, 
  vocab_dict, max_review_len, label_char):

  # write first time, append later
  fout = open(outpt_file, w_or_a, encoding="utf8")  
  offset = 3  # Keras offset: 'the' = 1 (most frequent)
      
  for i in range(len(reviews_lists)):  # walk each review
    curr_review = reviews_lists[i]
    n_words = len(curr_review)     
    if n_words "gt" max_review_len:
      continue  # next i, continue without writing anything

    n_pad = max_review_len - n_words   # num of 0s to pre-pend

    for j in range(n_pad):  # write padding to get 50 values
      fout.write("0 ")
    
    for word in curr_review: 
      # a word in test set might not have been in training set
      if word not in vocab_dict:  
        fout.write("2 ")   # 2 is out-of-vocab index        
      else:
        idx = vocab_dict[word] + offset
        fout.write("%d " % idx)
    
    fout.write(label_char + "\n")  # add label '0' or '1
        
  fout.close()

# -------------------------------------------------------------

def main():
  remove_chars = "!\"#$%&()*+,-./:;"lt"="gt"?@[\\]^_`{|}~" 
  # leave ' for words like it's  

  print("\nLoading all reviews into memory - be patient ")
  pos_train_reviews = get_reviews(".\\aclImdb\\train\\pos", 
    12500, remove_chars)
  neg_train_reviews = get_reviews(".\\aclImdb\\train\\neg",
    12500, remove_chars)
  pos_test_reviews = get_reviews(".\\aclImdb\\test\\pos",
    12500, remove_chars)
  neg_test_reviews = get_reviews(".\\aclImdb\\test\\neg",
    12500, remove_chars)

  # mp = max(len(l) for l in pos_train_reviews)  # 2469
  # mn = max(len(l) for l in neg_train_reviews)  # 1520
  # mm = max(mp, mn)  # longest review is 2469
  # print(mp, mn)

# -------------------------------------------------------------

  print("\nAnalyzing reviews and making vocabulary ")
  vocab_dict = make_vocab([pos_train_reviews, 
    neg_train_reviews])  # key = word, value = word rank
  v_len = len(vocab_dict)  
  # need this value, plus 4, for Embedding: 129888+4 = 129892
  print("\nVocab size = %d -- use this +4 for \
    Embedding nw " % v_len)

  max_review_len = 20   # use None for all reviews (any len)
  # if max_review_len == None or max_review_len "gt" mm:
  #   max_review_len = mm

  print("\nGenerating training file len %d words or less " \
    % max_review_len)

  generate_file(pos_train_reviews, ".\\imdb_train_20w.txt", 
    "w", vocab_dict, max_review_len, "1")
  generate_file(neg_train_reviews, ".\\imdb_train_20w.txt",
    "a", vocab_dict, max_review_len, "0")

  print("Generating test file with len %d words or less " \
    % max_review_len)

  generate_file(pos_test_reviews, ".\\imdb_test_20w.txt", 
    "w", vocab_dict, max_review_len, "1")
  generate_file(neg_test_reviews, ".\\imdb_test_20w.txt", 
    "a", vocab_dict, max_review_len, "0")

  # inspect a generated file
  # vocab_dict was used indirectly (offset)

  print("\nDisplaying encoded training file: \n")
  f = open(".\\imdb_train_20w.txt", "r", encoding="utf8")
  for line in f: 
    print(line, end="")
  f.close()

# -------------------------------------------------------------

  print("\nDisplaying decoded training file: \n") 

  index_to_word = {}
  index_to_word[0] = ""lt"PAD"gt""
  index_to_word[1] = ""lt"ST"gt""
  index_to_word[2] = ""lt"OOV"gt""
  for (k,v) in vocab_dict.items():
    index_to_word[v+3] = k

  f = open(".\\imdb_train_20w.txt", "r", encoding="utf8")
  for line in f:
    line = line.strip()
    indexes = line.split(" ")
    for i in range(len(indexes)-1):  # last is '0' or '1'
      idx = (int)(indexes[i])
      w = index_to_word[idx]
      print("%s " % w, end="")
    print("%s " % indexes[len(indexes)-1])
  f.close()

if __name__ == "__main__":
  main()

Demo code for imdb_lstm.py. Replace “lt” (less-than), etc. with symbols. My blog editor chokes on those symbols.

# imdb_lstm.py

# PyTorch 1.9.0-CPU Anaconda3-2020.02  Python 3.7.6
# Windows 10 

import numpy as np
import torch as T
device = T.device('cpu')

# -----------------------------------------------------------

class LSTM_Net(T.nn.Module):
  def __init__(self):
    # vocab_size = 129892
    super(LSTM_Net, self).__init__()
    self.embed = T.nn.Embedding(129892, 32)
    self.lstm = T.nn.LSTM(32, 75)
    self.drop = T.nn.Dropout(0.10)
    self.fc1 = T.nn.Linear(75, 10)  
    self.fc2 = T.nn.Linear(10, 2)  # 0=neg, 1=pos

  def forward(self, x):
    # x = review/sentence. length = 50 (fixed w/ padding)
    z = self.embed(x) 
    z = z.view(50, 1, 32)  # "seq batch input"
    lstm_oupt, (h_n, c_n) = self.lstm(z)
    z = lstm_oupt[-1]
    z = self.drop(z)
    z = T.tanh(self.fc1(z)) 
    z = self.fc2(z)  # CrossEntropyLoss will apply softmax
    return z  

# -----------------------------------------------------------

def accuracy(model, data_x, data_y):
  # data_x and data_y are lists of tensors
  model.eval()
  num_correct = 0; num_wrong = 0
  for i in range(len(data_x)):
    X = data_x[i]
    Y = data_y[i].reshape(1)
    with T.no_grad():
      oupt = model(X) 

    idx = T.argmax(oupt.data)
    if idx == Y:  # predicted == target
      num_correct += 1
    else:
      num_wrong += 1
  acc = (num_correct * 100.0) / (num_correct + num_wrong)
  model = model.train()
  return acc

# -----------------------------------------------------------

def main():
  # 0. get started
  print("\nBegin PyTorch IMDB LSTM demo ")
  print("Using only reviews with 50 or less words ")
  T.manual_seed(1)
  np.random.seed(1)

  # 1. load data from file
  print("\nLoading preprocessed train and test data ")
  max_review_len = 50 # exact review length
  
  train_xy = np.loadtxt(".\\Data\\imdb_train_50w.txt", 
    delimiter=" ",  usecols=range(0,51), dtype=np.int64)
  train_x = train_xy[:,0:50]
  train_y = train_xy[:,50]

  test_xy = np.loadtxt(".\\Data\\imdb_test_50w.txt", 
    delimiter=" ",  usecols=range(0,51), dtype=np.int64)
  test_x = test_xy[:,0:50]
  test_y = test_xy[:,50]
 
  # 1b. convert to tensors
  train_x = T.tensor(train_x, dtype=T.int64).to(device)
  train_y = T.tensor(train_y, dtype=T.int64).to(device)
  test_x = T.tensor(test_x, dtype=T.int64).to(device)
  test_y = T.tensor(test_y, dtype=T.int64).to(device)

  N = len(train_x)
  print("Data loaded. Number train items = %d " % N)

# -----------------------------------------------------------

  # 2. create network
  net = LSTM_Net().to(device)

  # 3. train model
  loss_func = T.nn.CrossEntropyLoss()  # does log-softmax()
  optimizer = T.optim.Adam(net.parameters(), lr=1.0e-3)
  max_epochs = 12
  log_interval = 2  # display progress

  print("\nStarting training with bat_size = 1")
  for epoch in range(0, max_epochs):
    net.train()  # set training mode
    indices = np.arange(N)
    np.random.shuffle(indices)
    tot_err = 0.0

    for i in range(N):  # one review at a time
      j = indices[i]
      X = train_x[j]
      Y = train_y[j].reshape(1)
      
      optimizer.zero_grad()
      oupt = net(X)  
      loss_val = loss_func(oupt, Y) 
      tot_err += loss_val.item()
      loss_val.backward()  # compute gradients
      optimizer.step()     # update weights

    if epoch % log_interval == 0:
      print("epoch = %4d  |" % epoch, end="")
      print("  avg loss = %7.4f  |" % (tot_err / N), end="")
      train_acc = accuracy(net, train_x, train_y)
      print("  accuracy = %7.2f%%" % train_acc)
      # test_acc = accuracy(net, test_x, test_y)  # 
      # print("  test accuracy = %7.2f%%" % test_acc)
  print("Training complete")

# -----------------------------------------------------------

  # 4. evaluate model
  test_acc = accuracy(net, test_x, test_y)
  print("\nAccuracy on test data = %7.2f%%" % test_acc)

  # 5. save model
  print("\nSaving trained model state")
  fn = ".\\Models\\imdb_model.pt"
  T.save(net.state_dict(), fn)

  # saved_model = Net()
  # saved_model.load_state_dict(T.load(fn))
  # use saved_model to make prediction(s)

  # 6. use model
  print("\nFor \"the movie was a great waste of my time\"")
  print("0 = negative, 1 = positive ")
  review = np.array([4, 20, 16, 6, 86, 425, 7, 58, 64], \
    dtype=np.int64)
  padding = np.zeros(41, dtype=np.int64)
  review = np.concatenate([review, padding])
  review = T.tensor(review, dtype=T.int64)
  
  net.eval()
  with T.no_grad():
    prediction = net(review)  # raw outputs
  print("\nlogits: ", end=""); print(prediction) 
  probs = T.softmax(prediction, dim=1)  # pseudo-probabilities
  probs = probs.numpy()
  print("pseudo-probs: ", end="")
  print("%0.4f %0.4f " % (probs[0][0], probs[0][1]))

  print("\nEnd PyTorch IMDB LSTM sentiment demo")

if __name__ == "__main__":
  main()