Creating a Custom Python Vocabulary Object From Scratch For NLP Problems

I’ve been looking at natural language processing (NLP) problems on-and-off for quite some time. One weekend I decided to implement a custom Vocabulary class from scratch. A Vocabulary object accepts a word/token and returns a unique integer ID. For example,

vocab = make_vocab(src_text, min_freq)
idx = vocab["and"]        # calls stoi[]
print(idx)                # 7
idx = vocab.stoi["code"]  # stoi directly
wrd = vocab.itos[7]
print(wrd)                # 'and'

The TorchText library has a built-in Vocab object but it’s very complex (so that it can handle almost any scenario) and you must still write code to parse the source text and frequencies and feed them to the Vocab constructor.

Writing a custom Vocabulary object isn’t too difficult — I guess. I’ve done so several times and it’s easy for me now. But I remember it wasn’t so easy the first time I tried.

An interesting design choice is how to work with the internal stoi (“string to integer”) dictionary object. Because you have to deal with words/tokens that aren’t in the dictionary, for example “flooglebarge”, you want to return ID = 0 for unknown words. A common strategy for situations like this is to write a __getitem__() function that calls into the built-in stoi dictionary get() method (all Python dictionaries have a get() method) and uses the default parameter for get() so that an unknown word returns the default value. Tricky.

Most Vocabulary objects implement an itos (“integer to string”) method that accepts an ID and returns the corresponding word/token. That can be implemented as either an ordinary list (because lists are accessed by an integer index) or as a dictionary (to keep design symmetry with the stoi dictionary).

All in all, it was an interesting exporation that solidified my knowledge of Vocabulary objects for NLP problems.

Four of the many different book covers for the James Bond spy novel “Casino Royale” by author Ian Fleming. Published in 1953, this was the first novel in the series. It was fairly well received in the U.K. but did poorly in the U.S. But in an interview in 1961, U.S. President J.F. Kennedy mentioned that “From Russia with Love” (1957, fifth in the series) was one of his favorite novels, and book sales exploded.

A Vocabulary object based on a Bond novel’s text would likely have separate IDs for “Bond” (the spy) and the verb “bond” (to combine things) and the noun “bond” (the financial instrument or word meaning a coupling).

Code below. Long.

# custom_vocab_demo.py

import torchtext as tt

g_toker = tt.data.utils.get_tokenizer("basic_english")

class MyVocab:
  # primary functionality is stoi

  def __init__(self, fn, tokenizer, min_freq):
    self.itos = dict()  # this could be a list
    self.stoi = dict()

    # get words and their counts from source
    tmp = dict()  # key = word/token, value = count
    
    f = open(fn, "r")
    for line in f:
      line = line.strip()       # remove NL
      txt = line.split(",")[1]  # assumes file structure
      tokens = g_toker(txt)     # lowered, split
      for tok in tokens:        # each word (or punc)
        if tok not in tmp:
          tmp[tok] = 1    # first time seen
        else:
          tmp[tok] += 1   # bump count
    f.close()
    
    # sort by frequency, high to low
    lst = []
    for key in tmp:  # key = word/token
      lst.append((tmp[key], key))  # count first simpler
    lst.sort(reverse=True) 
    # print(lst)

    # create the stoi and itos dictionaries
    self.itos[0] = '(unk)'
    self.itos[1] = '(pad)'
    idx = 2
    for (ct,tok) in lst:
      if ct "gte" min_freq:  # replace with operator
        self.stoi[tok] = idx
        self.itos[idx] = tok
        idx += 1

  def __getitem__(self, tok):
    return self.stoi.get(tok, 0)  # 0 if KeyError
        
def main():
  print("\nBegin custom MyVocab demo ")
  
  print("\nCreating vocab object from reviews20.txt ")
  fn = ".\\Data\\reviews20.txt"
  my_vocab = MyVocab(fn, g_toker, min_freq=1)
  idx = my_vocab["this"]  # calls __getitem__()
  print("this = ", idx)
  idx = my_vocab["story"]
  print("story = ", idx)
  idx = my_vocab["foobar"]
  print("foobar = ", idx)

  print("\nFirst 5 tokens: ")
  for i in range(5):
    print(i, my_vocab.itos[i])  # uses itos dict

  print("\nEnd demo ")

if __name__ == "__main__":
  main()