I Like the New TorchText v0.9 Dataset Interface

Version 0.9 of the PyTorch TorchText library was released a few days ago. The TorchText library has several built-in datasets for use with text and natural language processing experiments. The v0.9 interface is completely different from v0.8 and earlier. The new interface is much improved in my opinion.

I installed TorchText 0.9 for my current Python 3.7.6 CPU Windows system from download.pytorch.org/whl/torch_stable.html. Then I slowly but surely constucted a demo program for the IMDB movie review dataset. The IMDB dataset has 25,000 training and 25,000 test movie reviews. Each review is labeled “positive” or “negative”). My demo program 1.) loads IMDB data into memory (downloading first if necessary), 2.) counts how many of each word there are (each word is converted to lower case), 3.) builds a Vocabulary object from the counts.

The statements to load IMDB into memory are:

  print_w_time("\nLoading TorchText 0.9 IMDB train and test ")
  train_ds, test_ds = tt.datasets.IMDB()
  print_w_time("Data has been loaded ")

The default is to load both train and test datasets but you can specifiy just train or test in the constructor if you want. I used a program-deffined print_w_time() function so I could see how long each task took. A first-download takes about 5 minutes but then subsequent loading into memory from cache takes about 30 seconds.

Next, the demo walks through the training data, breaking each review into separate words, converting to lower case, counts into a Python Counter object:

  print_w_time("\nComputing counts for each word in train data ")
  toker = tt.data.utils.get_tokenizer("basic_english")
  counts = collections.Counter()
  for (label, line) in train_ds:
    counts.update(toker(line))
  print_w_time("All word counts determined ")
  n = len(counts)
  print("Found " + str(n) + " distinct words \n")

  print("the: " + str(counts["the"]))      # 335_746
  print("The: " + str(counts["The"]))      # 0
  print("fzq: " + str(counts["fzq"]))      # 0 (not error)

The upcoming Vocab object requires a Counter object. A Counter object is a specialized dict object that retrns 0 for an item not in the collection rather than throwing a KeyError exception. This dependency is mildly ugly in my opinion but sometimes it’s better to take on a dependency than to implement a duplicate class. There are 335,746 instances of the word “the” in the training data.

My demo concludes by using the Counter object to create a Vocab object:

  print_w_time("\nCreating Vocab object min_freq=5 ")
  vocab = tt.vocab.Vocab(counts, min_freq=5)
  print_w_time("Vocabulary object created ")
  print("")

  print("the: " + str(vocab.stoi["the"]))  # 2
  print("0  : " + str(vocab.itos[0]))      # unk
  print("1  : " + str(vocab.itos[1]))      # pad
  print("2  : " + str(vocab.itos[2]))      # the
  print("99 : " + str(vocab.itos[99]))     # its

The min_freq parameter is used to filter out very rare words, including accidental misspellings. The itos() (“integer-to-string”) function accepts an integer ID and returns the associated word. The stoi() function (“string-to-integer”) accepts a word and returns the assocaited integer ID. By default, “unk” and “pad” are reserved as tokens 0 and 1 for words that don’t appear in the vocabulary and dummy padding when it’s necessary to have a set of reviews that all have the same number of words.

My next experiment will explore using a TorchText IMDB Dataset object with a PyTorch DataLoader object. Prior to TorchText v0.9 a TorchText Dataset object was not compatible with a PyTorch DataLoader. This new TorchText Dataset with PyTorch Dataloader compatibility is the primary motivation for the creation of the new 0.9 version of torchText.



Yousuke Ozawa spent hours looking at Google Maps of New York City to construct this alphabet made from images of buildings viewed from above.


# new_torchtext_imdb.py
# much of this is all new from TorchText 0.9

import torchtext as tt
import collections
import time

def print_w_time(str):
  print(str + "  ", end="")
  # dt = time.strftime("%Y_%m_%d-%H_%M_%S")
  dt = time.strftime("%I_%M_%S_%p")
  print(dt)

def main():
  print("\nBegin demo of new TorchText interface for IMDB ")

  print_w_time("\nLoading TorchText 0.9 IMDB train and test ")
  train_ds, test_ds = tt.datasets.IMDB()
  print_w_time("Data has been loaded ")

  print_w_time("\nComputing counts for each word in train data ")
  toker = tt.data.utils.get_tokenizer("basic_english")
  counts = collections.Counter()
  for (label, line) in train_ds:
    counts.update(toker(line))
  print_w_time("All word counts determined ")
  n = len(counts)
  print("Found " + str(n) + " distinct words \n")

  print("the: " + str(counts["the"]))      # 335_746
  print("The: " + str(counts["The"]))      # 0
  print("fzq: " + str(counts["fzq"]))      # 0 (not error)

  print_w_time("\nCreating Vocab object min_freq=5 ")
  vocab = tt.vocab.Vocab(counts, min_freq=5)
  print_w_time("Vocabulary object created ")
  print("")

  print("the: " + str(vocab.stoi["the"]))  # 2
  print("0  : " + str(vocab.itos[0]))      # unk
  print("1  : " + str(vocab.itos[1]))      # pad
  print("2  : " + str(vocab.itos[2]))      # the
  print("99 : " + str(vocab.itos[99]))     # its

  print("\nEnd demo \n")

if __name__ == "__main__":
  main()
This entry was posted in PyTorch. Bookmark the permalink.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s