How to Create a Transformer Architecture Model for Natural Language Processing in Visual Studio Magazine.

I wrote an article titled “How to Create a Transformer Architecture Model for Natural Language Processing” in the November 2021 edition of the online Microsoft Visual Studio Magazine. See

My article explains how to create a transformer architecture model for natural language processing. Specifically, the article shows how to create a model that accepts a sequence of words such as “The man ran through the {blank} door” and then predicts most-likely words to fill in the blank.

Transformer architecture (TA) models such as BERT (bidirectional encoder representations from transformers) and GPT (generative pretrained transformer) have revolutionized natural language processing (NLP). But TA systems are extremely complex, and implementing them from scratch can take hundreds or thousands of man-hours. The Hugging Face (HF) library is open source code that has pretrained TA models and an API set for working with the models. The HF library makes implementing NLP systems using TA models much less difficult.

The demo program begins by loading a pretrained DistilBERT language model into memory. DistilBERT is a condensed version of the huge BERT language model. The source sentence is passed to a Tokenizer object which breaks the sentence into words/tokens and assigns an integer ID to each token. For example, one of the tokens is “man” and its ID is 1299, and the token that represents the blank-word is [MASK] and its ID is 103.

The token IDs are passed to the DistilBERT model and the model computes the likelihoods of 28,996 possible words/tokens to fill in the blank. The top five candidates to fill in the blank for “The man ran through the {blank} door” are: “front,” “bathroom,” “kitchen,” “back” and “garage.”

One way to think about the fill-in-the-blank example presented in this article is that the DistilBERT model gives you an English language expert. You can ask this expert things such as what is the missing word in a sentence, or how similar two words are. But the DistilBERT expert doesn’t have specific knowledge about anything beyond pure English. For example, the basic DistilBERT model doesn’t know anything about movies. It is possible to start with a basic DistilBERT model and then fine-tune the model to give it knowledge about movie reviews in order to create a movie review expert. The fine-tuned expert will know about English and also about the difference between a good movie review and a bad review.

Artificial intelligence has come a long way, but it will be quite some time until AI can understand photos like these ones. Left: This criminal has his hands full of trouble. Center: This criminal is on the espresso lane to jail. Right: Oopsie loompa.

This entry was posted in Machine Learning. Bookmark the permalink.

Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s