I’ve been looking at natural language processing (NLP) problems on-and-off for quite some time. One weekend I decided to implement a custom Vocabulary class from scratch. A Vocabulary object accepts a word/token and returns a unique integer ID. For example,
vocab = make_vocab(src_text, min_freq) idx = vocab["and"] # calls stoi print(idx) # 7 idx = vocab.stoi["code"] # stoi directly wrd = vocab.itos print(wrd) # 'and'
The TorchText library has a built-in Vocab object but it’s very complex (so that it can handle almost any scenario) and you must still write code to parse the source text and frequencies and feed them to the Vocab constructor.
Writing a custom Vocabulary object isn’t too difficult — I guess. I’ve done so several times and it’s easy for me now. But I remember it wasn’t so easy the first time I tried.
An interesting design choice is how to work with the internal stoi (“string to integer”) dictionary object. Because you have to deal with words/tokens that aren’t in the dictionary, for example “flooglebarge”, you want to return ID = 0 for unknown words. A common strategy for situations like this is to write a __getitem__() function that calls into the built-in stoi dictionary get() method (all Python dictionaries have a get() method) and uses the default parameter for get() so that an unknown word returns the default value. Tricky.
Most Vocabulary objects implement an itos (“integer to string”) method that accepts an ID and returns the corresponding word/token. That can be implemented as either an ordinary list (because lists are accessed by an integer index) or as a dictionary (to keep design symmetry with the stoi dictionary).
All in all, it was an interesting exporation that solidified my knowledge of Vocabulary objects for NLP problems.
Four of the many different book covers for the James Bond spy novel “Casino Royale” by author Ian Fleming. Published in 1953, this was the first novel in the series. It was fairly well received in the U.K. but did poorly in the U.S. But in an interview in 1961, U.S. President J.F. Kennedy mentioned that “From Russia with Love” (1957, fifth in the series) was one of his favorite novels, and book sales exploded.
A Vocabulary object based on a Bond novel’s text would likely have separate IDs for “Bond” (the spy) and the verb “bond” (to combine things) and the noun “bond” (the financial instrument or word meaning a coupling).
Code below. Long.
# custom_vocab_demo.py import torchtext as tt g_toker = tt.data.utils.get_tokenizer("basic_english") class MyVocab: # primary functionality is stoi def __init__(self, fn, tokenizer, min_freq): self.itos = dict() # this could be a list self.stoi = dict() # get words and their counts from source tmp = dict() # key = word/token, value = count f = open(fn, "r") for line in f: line = line.strip() # remove NL txt = line.split(",") # assumes file structure tokens = g_toker(txt) # lowered, split for tok in tokens: # each word (or punc) if tok not in tmp: tmp[tok] = 1 # first time seen else: tmp[tok] += 1 # bump count f.close() # sort by frequency, high to low lst =  for key in tmp: # key = word/token lst.append((tmp[key], key)) # count first simpler lst.sort(reverse=True) # print(lst) # create the stoi and itos dictionaries self.itos = '(unk)' self.itos = '(pad)' idx = 2 for (ct,tok) in lst: if ct "gte" min_freq: # replace with operator self.stoi[tok] = idx self.itos[idx] = tok idx += 1 def __getitem__(self, tok): return self.stoi.get(tok, 0) # 0 if KeyError def main(): print("\nBegin custom MyVocab demo ") print("\nCreating vocab object from reviews20.txt ") fn = ".\\Data\\reviews20.txt" my_vocab = MyVocab(fn, g_toker, min_freq=1) idx = my_vocab["this"] # calls __getitem__() print("this = ", idx) idx = my_vocab["story"] print("story = ", idx) idx = my_vocab["foobar"] print("foobar = ", idx) print("\nFirst 5 tokens: ") for i in range(5): print(i, my_vocab.itos[i]) # uses itos dict print("\nEnd demo ") if __name__ == "__main__": main()