Reading IMDB Movie Review Dataset Files

I was working on the well-known IMDB movie review sentiment analysis problem The goal is to create a machine learning model that accepts the text of a movie review and predicts if the review is positive (class 1) or negative (class 0).

For experimentation I created a tiny dataset with just 8 reviews: 2 training positive, 2 training negative, 2 test positive, 2 test negative. I used the same structure as the full 25,000-review dataset. The root directory has two subdirectories named “pos” and “neg”. Each subdirectory has individual text files, one file per review.

A major challenge when working with ML is reading data files into memory. I experimented with two different approaches, classic and modern. The classic technique uses the os library along with the os.listdir() function. The classic technique is clear but is brittle because I hard-code directory paths using Windows “\\” separators.

The modern technique uses the Path library along with the iterdir() method. The modern technique is short and efficient but the code is a bit obscure.

The bottom line is that the modern technique is preferable in most cases.



Left: Classic space suits from “Destination Moon” (1950). Center: Neo-modern space suit from “2001: A Space Odyssey” (1968). Right: Modern space suit from “Armageddon” (1998).


Demo code:

# read_imdb_files.py

import os                 # classic
from pathlib import Path  # modern

def read_imdb_classic(root_dir):
  reviews = []; labels = []
  for label_dir in ["pos", "neg"]:
    dir = root_dir + "\\" + label_dir
    for fname in os.listdir(dir):
      full_name = dir + "\\" + fname
      with open(full_name, 'r', encoding='utf-8') as f:
        txt = f.read()
        reviews.append(txt)
        if label_dir == "pos":
          labels.append(1)
        else:
          labels.append(0)
  return (reviews, labels)

def read_imdb_modern(root_dir):
  reviews = []; labels = []
  root_dir = Path(root_dir)
  for label_dir in ["pos", "neg"]:
    for f_handle in (root_dir/label_dir).iterdir():
      reviews.append(f_handle.read_text(encoding='utf-8'))
      if label_dir == "pos":
        labels.append(1)
      else:
        labels.append(0)
  return (reviews, labels)

print("\nBegin reading IMDB files demo ")
  
root_dir = ".\\DataTiny\\aclImdb\\train"

print("\nReading IMDB files classic technique: \n")
(reviews, labels) = read_imdb_classic(root_dir)
print(reviews); print(labels)

print("\nReading IMDB files modern technique: \n")
(reviews, labels) = read_imdb_modern(root_dir)
print(reviews); print(labels)

print("\nEnd demo ")
This entry was posted in Machine Learning. Bookmark the permalink.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s