Tokenizing Text Using the Basic English Algorithm

In natural language processing (NLP) problems you must tokenize the source text. This means you must split each word/token, usually convert to lower case, replace some punctuation, and so on. In Python, the spaCy library and the NLTK (natural language toolkit) library have many NLP functions including tokenizers. The TorchText library has tokenizers too.

In some NLP problem scenarios you need a custom tokenizer to deal with oddities of your source text. One rainy weekend day I sat down on my couch and refactored the TorchText “basic_english” tokenizer source code. The algorithm is:

1. Convert source text to lower case
2. add space before and after single-quote, period, comma,
left paren, right paren, exclam, question mark.
3. replace colon, semicolon, (br /) with a space
4. remove double-quote
5. split on whitespace

Somewhat unfortunately, the TorchText source code uses a Regular Expression approach — ugh. I am not a fan of regular expressions. When they work you’re fine but trying to debug a regular expression is a trip to the seventh level of Hell.

All of the TorchText tokenizer regular expression functionality is string replacement, so it would be an easy task to yank out the regular expression and refactor using simple string replacement. This would make it easy to deal with situations where you want to retain the capitalization of certain words that have important meaning in your source text. For example, if you were working with calendar date text, you might want to retain capitalization of months January, February, March, and so on, because “march” can be the verb, “may” can be the request word, and so on.

Using string replacement is slower than regular expressions but easier to customize. The code could look like:

class MyTokenizer:
  # use simple string replacement instead of RE
  def tokenize(self, line):
    line = line.lower()
    line = line.replace(".", " . ")
    line = line.replace(":", " ")
    line = line.replace('\"', "")
    # etc.
    return line.split()

my_toker = MyTokenizer()
line = "Blah, blah, whatever."
result = my_toker.tokenize(line)

Whenever I write code, I try to avoid external dependencies. Writing a custom text tokenizer is one way to eliminate an external dependency for NLP problems.

The term art tokenization can mean several things. One meaning is to create a set of blockchain tokens that establish ownership of an expensive piece of art like a Van Gogh or da Vinci. This allows partial ownership of a piece of art, and allows art to be traded like the way company shares can be traded in he stock market. The weird idea is that the physical presence of a work of art isn’t too important when digital images of it can be shown.

Here are three examples of science-inspired art. Will any of them ever be worth millions of dollars? Who knows — maybe. Left: by artist Fabian Oefner. Center: by Igor Siwanowicz. Right: by an anonymous middle school student, posted by his teacher, Jim Dodson.

This entry was posted in Machine Learning. Bookmark the permalink.

Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s