How To Compute Transformer Architecture Model Accuracy in Visual Studio Magazine

I wrote an article titled “How To Compute Transformer Architecture Model Accuracy” in the December 2021 edition of Microsoft Visual Studio Magazine. See https://visualstudiomagazine.com/articles/2021/12/07/compute-ta-model-accuracy.aspx.

My article explains how to compute the accuracy of a trained PyTorch Transformer Architecture model for natural language processing. Specifically, the article describes how to compute the classification accuracy of a condensed BERT model that predicts the sentiment (positive or negative) of movie reviews taken from the IMDB movie review dataset.

You can think of a pretrained TA model as sort of an English language expert that knows about things such as sentence structure and synonyms. But the TA expert doesn’t know anything about movies and so you provide additional training to fine-tune the model so that it understands the difference between a positive movie review and a negative review. I explained how to fine-tune and save a binary classification model in a previous article.

After you train a TA model, you need to write program-defined code that computes the classification accuracy of the model. This is non-trivial.

def accuracy(model, ds, toker, num_reviews):
  # item-by-item: good for debugging but slow
  n_correct = 0; n_wrong = 0
  loader = DataLoader(ds, batch_size=1, shuffle=False)
  for (b_ix, batch) in enumerate(loader):
    print("==========================================")
    print(str(b_ix) + "  ", end="")
    input_ids = batch['input_ids'].to(device)  # just IDs

    # tensor([[101, 1045, 2253, . . 0, 0]])
    # words = toker.decode(input_ids[0])
    # [CLS] i went and saw . . [PAD] [PAD]

    lbl = batch['labels'].to(device)  # target 0 or 1
    mask = batch['attention_mask'].to(device)
    with T.no_grad():
      outputs = model(input_ids, \
        attention_mask=mask, labels=lbl)

    # SequenceClassifierOutput(
    #  loss=tensor(0.0168),
    #  logits=tensor([[-2.2251, 1.8527]]),
    #  hidden_states=None,
    #  attentions=None)
    logits = outputs[1]  # a tensor
    pred_class = T.argmax(logits)
    print("  target: " + str(lbl.item()), end="")
    print("  predicted: " + str(pred_class.item()), end="")
    if lbl.item() == pred_class.item():
      n_correct += 1; print(" | correct")
    else:
      n_wrong += 1; print(" | wrong")

    if b_ix == num_reviews - 1:
      break

    if lbl.item() != pred_class.item():
      print("Test review as token IDs: ")
      T.set_printoptions(threshold=100, edgeitems=3)
      print(input_ids)
      print("Review source: ")
      words = toker.decode(input_ids[0])  # giant string
      print_list(words.split(' '), 3, 3)

  print("==========================================")

  acc = (n_correct * 1.0) / (n_correct + n_wrong)
  print("\nCorrect: %4d " % n_correct)
  print("Wrong:   %4d " % n_wrong)
  return acc

Until recently, creating a natural language processing systems such as movie sentiment analysis was a major undertaking. You’d have to start from scratch and then train a model on a huge corpus of text. The process typically took several months. The availability of pretrained models from Hugging Face and other sources greatly simplifies creating NLP systems. However, NLP systems still require significant time and effort.

Determining transformer architecture model accuracy is difficult. Determining bowling accuracy is not so difficult. Three screenshots from YouTube where bowlers got their fingers stuck in their bowling ball.