Computing Accuracy of a Hugging Face Fine-Tuned Binary Classification Model

I’ve been slowly but surely walking through Hugging Face (HF) documentation examples. HF is an open-source code library for transformer architecture (TA) systems for natural language processing (NLP).

In a recent exploration, I refactored a documentation example that tackled the IMDB movie review sentiment analysis problem. The ultimate goal is to create a model that accepts a movie review in raw text (“The movie was a great waste of my time”) and outputs 0 = negative sentiment or 1 = positive sentiment.

The documentation example was excellent but it stopped after the model was created. To solidify my understanding I decided to write a second program that computes the classification accuracy of the model. Put another way, I wanted to investigate how to use a fine-tuned HF binary classification model (because computing accuracy involves sending input and capturing output).

I implemented an accuracy() function without too much difficulty. As usual with very complex software, there were several hurdles along the way. I did notice that I used many tricks I’ve learned over the past several years — if I didn’t have all that background knowledge, implementing an accuracy() function would almost certainly have taken me several days instead of several hours.

My model used only the first 100 positive training reviews and the first 100 negative reviews — the full IMDB dataset has 12,500 reviews and I wanted my program runs to complete in minutes rather than hours or days.

I wrote two accuracy() functions. The first used an item-by-item approach:

loop each data item (review)
  get review text
  get review label (0 or 1)
  send review to model, get output (logits)
  if logits match label
    num_correct += 1
  else
    num_wrong += 1
 end-loop
return num_correct / (num_correct + num_wrong)

The devil was in the details. This accuracy() function was very slow because TA models are nutso-complicated with millions of weights and biases. So, I wrote a second accuracy() function that uses a set-approach, computing all outputs at once. This approach was significantly trickier to write. It is marginally faster than the item-by-item approach.

The IMDB model gave 99% accuracy (98 out of 200 correct) on the training data. I used the training data rather than the test data because I only used 200 items to train the model, and only used 3 training epochs. (On the training data, the model scored 71% accuracy).

Interesting stuff.

The early days of science fiction featured all kinds of models of robots. Here are three magazine covers where I’d describe the robot models as inaccurate. But interesting.

Demo code:

# imdb_eval.py
# tuned HF model for IMDB sentiment analysis accuracy
# zipped raw data at:
# https://ai.stanford.edu/~amaas/data/sentiment/

import numpy as np  # not used
from pathlib import Path
from transformers import DistilBertTokenizerFast
import torch
from torch.utils.data import DataLoader
from transformers import DistilBertForSequenceClassification, \
  AdamW
from transformers import logging  # to suppress warnings

device = torch.device('cpu')

class IMDbDataset(torch.utils.data.Dataset):
  def __init__(self, encodings, labels):
    self.encodings = encodings
    self.labels = labels

  def __getitem__(self, idx):
    item = {key: torch.tensor(val[idx]) for key, val \
      in self.encodings.items()}
    item['labels'] = torch.tensor(self.labels[idx])
    return item

  def __len__(self):
    return len(self.labels)

def read_imdb_split(split_dir):
  split_dir = Path(split_dir)
  texts = []
  labels = []
  for label_dir in ["pos", "neg"]:
    for text_file in (split_dir/label_dir).iterdir():
      texts.append(text_file.read_text(encoding='utf-8'))
      labels.append(0 if label_dir is "neg" else 1)
  return texts, labels

def accuracy_slow(model, ds):
  # item-by-item: good for debugging but very slow
  n_correct = 0; n_wrong = 0
  loader = DataLoader(ds, batch_size=1, shuffle=False)
  for (b_ix, batch) in enumerate(loader):
    input_ids = batch['input_ids'].to(device)  # no masks
    lbl = batch['labels'].to(device)  # actual 0 or 1
    # for id in input_ids[0]:
    #   word = tokenizer.decode(id)
    #   print(id, word)
    #   input()
    attention_mask = batch['attention_mask'].to(device)
    with torch.no_grad():
      outputs = model(input_ids, \
        attention_mask=attention_mask, labels=lbl)
    # SequenceClassifierOutput(
    #  loss=tensor(0.0168),
    #  logits=tensor([[-2.2251,  1.8527]]),
    #  hidden_states=None,
    #  attentions=None)
    logits = outputs[1]  # a tensor
    pred_class = torch.argmax(logits)
    if lbl.item() == pred_class.item():
      n_correct += 1
    else:
      n_wrong += 1

  acc = (n_correct * 1.0) / (n_correct + n_wrong)
  print("\nCorrect: %4d " % n_correct)
  print("Wrong:   %4d " % n_wrong)
  return acc

def accuracy_fast(model, ds):
  # all items at once: slightly faster but less clear
  loader = DataLoader(ds, batch_size=len(ds), shuffle=False)
  for (b_ix, batch) in enumerate(loader):  # one giant batch
    input_ids = batch['input_ids'].to(device)  # Size([200, 512])
    lbls = batch['labels'].to(device)  # all labels Size([200])

    attention_mask = batch['attention_mask'].to(device)
    with torch.no_grad():
      outputs = model(input_ids, \
        attention_mask=attention_mask, labels=lbls)
    logits = outputs[1]  # all logits Size([200, 2])
    pred_y = torch.argmax(logits, dim=1)  # 0s or 1s Size([200])

    num_correct = torch.sum(lbls==pred_y)
    print("\nCorrect: ")
    print(num_correct.item())
    acc = (num_correct.item() * 1.0 / len(ds))
    return acc

def main():
  # 0. get ready
  print("\nBegin evaluation of IMDB HF model ")
  logging.set_verbosity_error()  # suppress wordy warnings
  torch.manual_seed(1)
  np.random.seed(1)

  # 1. load pretrained model
  print("\nLoading untuned model ")
  model = \
    DistilBertForSequenceClassification.from_pretrained( \
    'distilbert-base-uncased')
  model.to(device)
  print("Done ")

  # 2. load tuned model wts and biases
  print("\nLoading tuned model wts and biases ")
  model.load_state_dict(torch.load(".\\Models\\imdb_state.pt"))
  model.eval()
  print("Done ")

  # 3. load training data used to create tuned model
  print("\nLoading training data from file into memory ")
  train_texts, train_labels = \
    read_imdb_split(".\\DataSmall\\aclImdb\\train")
  print("Done ")

  # 4. tokenize the raw text data
  print("\nTokenizing training text data ")
  tokenizer = \
    DistilBertTokenizerFast.from_pretrained(\
    'distilbert-base-uncased')
  train_encodings = \
    tokenizer(train_texts, truncation=True, padding=True)
  print("Done ")

  # 5. put tokenized text into PyTorch Dataset
  print("\nConverting tokenized text into Pytorch Datasets ")
  train_dataset = IMDbDataset(train_encodings, train_labels)
  print("Done ")

  # 6. compute classification accuracy
  print("\nComputing model accuracy (slow) on train data ")
  acc = accuracy_slow(model, train_dataset)
  print("Accuracy = %0.4f " % acc)

  print("\nComputing model accuracy (fast) on train data ")
  acc = accuracy_fast(model, train_dataset)
  print("Accuracy = %0.4f " % acc)

  print("\nEnd demo ")

if __name__ == "__main__":
  main()