How to Fine-Tune a Transformer Architecture NLP Model in Visual Studio Magazine

I wrote an article titled “How to Fine-Tune a Transformer Architecture NLP Model” in the November 2021 edition of Microsoft Visual Studio Magazine. See https://visualstudiomagazine.com/articles/2021/11/16/fine-tune-nlp-model.aspx.

My article describes how to fine-tune a pretrained Transformer Architecture model for natural language processing. Specifically, the article explains how to fine-tune a condensed version of a pretrained BERT model to create binary classifier for a subset of the IMDB movie review dataset. The goal is sentiment analysis — accept the text of a movie review (such as, “This movie was a great waste of my time”) and output class 0 (negative review) or class 1 (positive review).

You can think of a pretrained transformer architecture (TA) model as sort of an English language expert. But the TA expert doesn’t know anything about movies and so you provide additional training to fine-tune the model so that it understands the difference between a positive movie review and a negative review.

I present a demo program. It begins by loading a small 200-item subset of the IMDB movie review dataset into memory. The full IMDB dataset has 50,000 movie reviews — 25,000 reviews for training and 25,000 reviews for testing, where there are 12,500 positive and 12,500 negative reviews. Working with the full dataset is very time-consuming so the demo data uses just the first 100 positive training reviews and the first 100 negative training reviews.

The reviews are read into memory then converted to a data structure that holds integer tokens. For example, the word “movie” has token ID = 3185. The tokenized movie reviews data structure is fed to a PyTorch Dataset object that is used to send batches of tokenized reviews and their associated labels to training code.

After the movie review data has been prepared, the demo loads a pretrained DistilBERT model into memory. DistilBERT is a condensed (“distilled”), but still large, version of the huge BERT model. The uncased version of DistilBERT has 66 million weights and biases. Then the demo fine-tunes the pretrained model by training the model using standard PyTorch techniques. The demo concludes by saving the fine-tuned model to file.

Working with natural language processing (NLP) is very challenging.



Writing a novel is arguably the highest form of NLP. Here are three books I read over and over. Left: “A Princess of Mars” (1912) by Edgar Rice Burroughs. Center: “Starship Troopers” (1959) by Robert A. Heinlein. Right: “Tom Swift and His Diving Seacopter” (1956) by “Victor Appleton II” (ghost writer James Lawrence).


This entry was posted in Machine Learning. Bookmark the permalink.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s