## Positive and Unlabeled Learning: How Complex is Too Complex?

One of my ongoing projects is to design an improved algorithm for PUL (positive and unlabeled learning). The problem scenario is that you have some data where the class label to predict is class 1 = positive, and other data that is unlabeled, meaning it could be either class 0 = negative or class 1 = positive. The goal is to analyze the unlabeled data and guess if each is either class 0 or class 1.

Medical data is often PUL — a few patients have a disease, but many thousands of patients are unlabeled.

I’ve designed a neural-based PUL system that seems to work very well . . . sort of. The problem is that the system I designed is very complex because it has dozens of hyperparameters. Examples include neural architecture (number layers, activations, etc.), neural training (batch size, learning rate, etc.), and dozens of PUL system design choices.

Based on my years of experience, my PUL system, as it stands now, is interesting and possibly useful from a research / theoretical perspective, but the system is less useful from a practical perspective. I’ve worked in several software production environments, and in many situations system simplicity is more important than a small increase in performance. Put somewhat differently, this PUL system might be useful for one-off data analysis and experimentation but not as useful as a black box system.

My demo data looks like this:

```# patients_positive.txt
1    0.24   1   0   0   0.2950   0   0   1   1
1    0.45   0   1   0   0.5410   0   1   0   1
1    0.55   0   0   1   0.6460   1   0   0   1
. . .

# patients_unlabeled.txt
-9   0.39   0   0   1   0.5120   0   1   0   0
-9   0.36   1   0   0   0.4450   0   1   0   0
-9   0.50   0   1   0   0.5650   0   1   0   1
. . .
```

The data is synthetic. The first column holds a label indicating if the patient has a disease, where -9 indicates unlabeled and 1 indicates positive. The next columns are predictor variables. The last column holds the true class label, 0 or 1, so I can evaluate the accuracy of the PUL system. There are 20 positive data items and 180 unlabeled items.

The output of the system is a pair of probabilities for each unlabeled data item, for example [0.123, 0.877], where the first value is probability of class 0, and second value is probability of class 1. The system uses a delta threshold where only those items where the difference between the prob(0) and prob(1) is greater than the delta, are used to make predictions. For example, if the threshold is 0.50 then a result like [0.20, 0.80] is used (prediction is class 1) but a result like [0.45, 0.55] isn’t used because the probabilities are too close together.

My neural system achieves 85% accuracy using a threshold = 0.50, but with such as large threshold only 33 of the 180 unlabeled data items are predicted (29 correct, 5 wrong). A smaller threshold makes more predictions but with lower accuracy.

One of my colleagues, Alexandra S., pointed out that in PUL systems it’s often important to have a human in the loop. In other words, for the synthetic patient data of my demo, those unlabeled items that are marked as class 1 should not be automatically assumed to be class 1 with absolute certainty — the items should be thought of as possibly class 1 and then examined closely by a human.

Unlabeled data hides its true identity. Masks do the same for people. The Venice Carnival runs roughly the two weeks before Lent — the 40 days preceding Easter — and has featured beautiful masks and costumes since the 12th century. For these masks, more complexity is more appealing (to me anyway).

This entry was posted in Machine Learning. Bookmark the permalink.