Positive and Unlabeled Learning (PUL) Using PyTorch

I wrote an article titled “Positive and Unlabeled Learning (PUL) Using PyTorch” in the May 2021 edition of the online Microsoft Visual Studio Magazine. See https://visualstudiomagazine.com/articles/2021/05/20/pul-pytorch.aspx.

A positive and unlabeled learning (PUL) problem occurs when a machine learning set of training data has only a few positive labeled items and many unlabeled items. For example, suppose you want to train a machine learning model to predict if a hospital patient has a disease or not, based on predictor variables such as age, blood pressure, and so on. The training data might have a few dozen instances of items that are positive (class 1 = patient has disease) and many hundreds or thousands of instances of data items that are unlabeled and so could be either class 1 = patient has disease, or class 0 = patient does not have disease.

The goal of PUL is to use the information contained in the dataset to guess the true labels of the unlabeled data items. After the class labels of some of the unlabeled items have been guessed, the resulting labeled dataset can be used to train a binary classification model using any standard machine learning technique, such as k-nearest neighbors classification, neural binary classification, logistic regression classification, naive Bayes classification, and so on.

PUL is challenging and there are several techniques to tackle such problems. I created a synthetic dataset of Employee information where the goal is to predict if an employee is an introvert (0) or extrovert (1). There are 200 data (employee) items. Only 20 are labled as positive = extrovert = 1. The other 180 dara items are unlabeled and could be positive, or negative = introvert = 0.

The demo program repeatedly (eight times) trains a helper binary classifier using the 20 positive employee data items and 20 randomly selected unlabeled items which are temporarily treated as negative. Expressed in pseudo-code:

create a 40-item train dataset with all 20 positive
  and 20 randomly selected unlabeled items that
  are temporarily treated as negative
    
loop several times
  train a binary classifier using the 40-item train data
  use trained model to score the 160 unused unlabeled
    data items
  accumulate the p-score for each unused unlabeled item
    
  generate a new train dataset with the 20 positive
    and 20 different unlabeled items treated as negative
end-loop
  

for-each of the 180 unlabeled items
  compute the average p-value 

  if avg p-value  hi threshold
    guess its label as positive
  else
    insufficient evidence to make a guess
  end-if
end-for

Somewhat unexpectedly, the most difficult part of a PUL system is wrangling the data to generate dynamic (changing) training datasets. The challenge is to be able to create an initial training dataset with the 20 positive items and 20 randomly selected unlabeled items like so:

train_file = ".\\Data\\employee_pul_200.txt"
train_ds = EmployeeDataset(train_file, 20, 180)

And then inside a loop, be able to reinitialize the training dataset with the same 20 positive items but 20 different unlabeled items:

train_ds.reinit()

Because the PUL guessing process is probabilistic, there are many approaches you can use. The technique presented in the article is based on a 2013 research paper by F. Mordelet and J.P. Vert, titled “A Bagging SVM to Learn from Positive and Unlabeled Examples”. That paper uses a SVM (support vector machine) binary classifier to analyze unlabeled data. The article uses a neural binary classifier instead. The approach presented in the article is new and mostly unexplored.


Here are three old devices and descriptions of what they did. All three devices are real, but only one description is true; the other two descriptions I made up and are completely false. Can you guess which is the positive label/description? Answer below.

Left: This is a woman wearing a “brank”, a device husbands put on wives who talked too much. It had a bit similar to the one on a horse bridle.
Center: This is a device called a “Johnson passer” used in London to put a divider down the middle of a street. The person riding in the cab released small white pebbles through the tube in the back.
Right: This is a device used by fishermen called a “bobblet”. They would stick the bill into the water and blow, creating bubbles that would attract crabs and eels, which were then speared.


This entry was posted in Machine Learning, PyTorch. Bookmark the permalink.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s