## Computing Accuracy of a Hugging Face Fine-Tuned Binary Classification Model

I’ve been slowly but surely walking through Hugging Face (HF) documentation examples. HF is an open-source code library for transformer architecture (TA) systems for natural language processing (NLP).

In a recent exploration, I refactored a documentation example that tackled the IMDB movie review sentiment analysis problem. The ultimate goal is to create a model that accepts a movie review in raw text (“The movie was a great waste of my time”) and outputs 0 = negative sentiment or 1 = positive sentiment.

The documentation example was excellent but it stopped after the model was created. To solidify my understanding I decided to write a second program that computes the classification accuracy of the model. Put another way, I wanted to investigate how to use a fine-tuned HF binary classification model (because computing accuracy involves sending input and capturing output).

I implemented an accuracy() function without too much difficulty. As usual with very complex software, there were several hurdles along the way. I did notice that I used many tricks I’ve learned over the past several years — if I didn’t have all that background knowledge, implementing an accuracy() function would almost certainly have taken me several days instead of several hours.

My model used only the first 100 positive training reviews and the first 100 negative reviews — the full IMDB dataset has 12,500 reviews and I wanted my program runs to complete in minutes rather than hours or days.

I wrote two accuracy() functions. The first used an item-by-item approach:

```loop each data item (review)
get review text
get review label (0 or 1)
send review to model, get output (logits)
if logits match label
num_correct += 1
else
num_wrong += 1
end-loop
return num_correct / (num_correct + num_wrong)
```

The devil was in the details. This accuracy() function was very slow because TA models are nutso-complicated with millions of weights and biases. So, I wrote a second accuracy() function that uses a set-approach, computing all outputs at once. This approach was significantly trickier to write. It is marginally faster than the item-by-item approach.

The IMDB model gave 99% accuracy (98 out of 200 correct) on the training data. I used the training data rather than the test data because I only used 200 items to train the model, and only used 3 training epochs. (On the training data, the model scored 71% accuracy).

Interesting stuff.

The early days of science fiction featured all kinds of models of robots. Here are three magazine covers where I’d describe the robot models as inaccurate. But interesting.

Demo code:

```# imdb_eval.py
# tuned HF model for IMDB sentiment analysis accuracy
# zipped raw data at:
# https://ai.stanford.edu/~amaas/data/sentiment/

import numpy as np  # not used
from pathlib import Path
from transformers import DistilBertTokenizerFast
import torch
from transformers import DistilBertForSequenceClassification, \
from transformers import logging  # to suppress warnings

device = torch.device('cpu')

class IMDbDataset(torch.utils.data.Dataset):
def __init__(self, encodings, labels):
self.encodings = encodings
self.labels = labels

def __getitem__(self, idx):
item = {key: torch.tensor(val[idx]) for key, val \
in self.encodings.items()}
item['labels'] = torch.tensor(self.labels[idx])
return item

def __len__(self):
return len(self.labels)

split_dir = Path(split_dir)
texts = []
labels = []
for label_dir in ["pos", "neg"]:
for text_file in (split_dir/label_dir).iterdir():
labels.append(0 if label_dir is "neg" else 1)
return texts, labels

def accuracy_slow(model, ds):
# item-by-item: good for debugging but very slow
n_correct = 0; n_wrong = 0
input_ids = batch['input_ids'].to(device)  # no masks
lbl = batch['labels'].to(device)  # actual 0 or 1
# for id in input_ids[0]:
#   word = tokenizer.decode(id)
#   print(id, word)
#   input()
outputs = model(input_ids, \
# SequenceClassifierOutput(
#  loss=tensor(0.0168),
#  logits=tensor([[-2.2251,  1.8527]]),
#  hidden_states=None,
#  attentions=None)
logits = outputs[1]  # a tensor
pred_class = torch.argmax(logits)
if lbl.item() == pred_class.item():
n_correct += 1
else:
n_wrong += 1

acc = (n_correct * 1.0) / (n_correct + n_wrong)
print("\nCorrect: %4d " % n_correct)
print("Wrong:   %4d " % n_wrong)
return acc

def accuracy_fast(model, ds):
# all items at once: slightly faster but less clear
for (b_ix, batch) in enumerate(loader):  # one giant batch
input_ids = batch['input_ids'].to(device)  # Size([200, 512])
lbls = batch['labels'].to(device)  # all labels Size([200])

outputs = model(input_ids, \
logits = outputs[1]  # all logits Size([200, 2])
pred_y = torch.argmax(logits, dim=1)  # 0s or 1s Size([200])

num_correct = torch.sum(lbls==pred_y)
print("\nCorrect: ")
print(num_correct.item())
acc = (num_correct.item() * 1.0 / len(ds))
return acc

def main():
print("\nBegin evaluation of IMDB HF model ")
logging.set_verbosity_error()  # suppress wordy warnings
torch.manual_seed(1)
np.random.seed(1)

model = \
DistilBertForSequenceClassification.from_pretrained( \
'distilbert-base-uncased')
model.to(device)
print("Done ")

# 2. load tuned model wts and biases
model.eval()
print("Done ")

# 3. load training data used to create tuned model
train_texts, train_labels = \
print("Done ")

# 4. tokenize the raw text data
print("\nTokenizing training text data ")
tokenizer = \
DistilBertTokenizerFast.from_pretrained(\
'distilbert-base-uncased')
train_encodings = \
print("Done ")

# 5. put tokenized text into PyTorch Dataset
print("\nConverting tokenized text into Pytorch Datasets ")
train_dataset = IMDbDataset(train_encodings, train_labels)
print("Done ")

# 6. compute classification accuracy
print("\nComputing model accuracy (slow) on train data ")
acc = accuracy_slow(model, train_dataset)
print("Accuracy = %0.4f " % acc)

print("\nComputing model accuracy (fast) on train data ")
acc = accuracy_fast(model, train_dataset)
print("Accuracy = %0.4f " % acc)

print("\nEnd demo ")

if __name__ == "__main__":
main()
```
This entry was posted in PyTorch. Bookmark the permalink.