Suppose you have a machine learning dataset for training, where only a few data items have a positive label (class = 1), but all the other data items are unlabeled and could be either negative (class = 0) or positive. This is called a positive and unlabeled learning (PUL) problem. PUL problems often appear in medical scenarios (only a few patients are diagnosed as class 1, all others are unknown) and in security scenarios.

To make sense of PUL data and use it to train a prediction model, you must somehow use the information contained in the PUL data to make intelligent guesses about the labels for the unlabeled items. This is called “finding reliable negatives”.

This is a very difficult problem. I’ve experimented with dozens of schemes for identifying reliable negatives in PUL data. The bottom line is that all techniques have many hyperparameters and results can vary wildly.

For my experiments, I set up a synthetic dataset with 200 items of Employee information. The data looks like:

-2 0.39 0 0 1 0.5120 0 1 0 1 0.24 1 0 0 0.2950 0 0 1 -2 0.36 1 0 0 0.4450 0 1 0 -2 0.50 0 1 0 0.5650 0 1 0 -2 0.19 0 0 1 0.3270 1 0 0 . . .

The first column is introvert or extrovert, encoded as 1 = positive = extrovert (20 items), and -2 = unlabeled (180 items). The goal of PUL is to intelligently guess 0 = negative, or 1 = positive, for as many of the unlabeled data items as possible.

The other columns in the dataset are employee age (normalized by dividing by 100), city (one of three, one-hot encoded), annual income (normalized by dividing by $100,000), and job-type (one of three, one-hot encoded).

The dataset was artificially constructed so that even numbered items [0], [2], [4], etc. are actually class 0 = negative, and odd numbered items [1], [3], [5], etc. are actually class 1. This allows the PUL system to measure its accuracy. In a non-demo PUL scenario, you won’t know the true class labels.

My latest exploration used this approach:

create a dataset with all 20 known positive items and 20 items with random inputs marked as negative use dataset to train a binary classifier (where the output is a p-value between 0 and 1) scan dataset to find min p-score for the 20 positive items and the max p-score loop each item of the PUL data feed item to binary classifier and compute the p-score if label = 1 then it's a known positive, continue else if p-score less-than min_p_score * 0.9 mark this item as a reliable negative class 0 else if p-score grtr-than max_p_score * 0.9 mark this item as a relaible positive class 1 else not enough evidence so leave as unlabeled end-if end-loop

Once you have examined the PUL data and identified reliable negatives (and new reliable positives), you can either 1.) repeat the process with the updated dataset, or 2.) toss out the unlabeled items and then use the dataset to train a prediction model.

The ideas are conceptually very simple, but implementation is tricky. My results were quite satisfactory — but depend on over a dozen hyperparameters (batch_size, optimization algorithm, learning rate, NN architecture, weight initialization algorithm, etc., etc.)

Interesting topic.

*Here are three cars made in 1970 that routinely show up in Internet searches for “ugliest cars of the 70s” and so they’d be labeled class 1 = positive (ugly). But I would assign a class label of class 0 = not ugly to all three. Left: AMC Javelin AMX (a competitor to the Ford Mustang of the time). Center: Datsun (Nissan) 510 in front of Univ. of Calif. at Irvine which was under construction at the time. I had this model of car and went to UCI when it was still under construction. Right: AMC Pacer. Weird but appealing (to me) car with a passenger side door that was 4 inches longer than the driver side door!*

Code (PyTorch) below. Long.

# employee_pul_find_reliables.py # PyTorch 1.9.0-CPU Anaconda3-2020.02 Python 3.7.6 # Windows 10 # load all 20 known positives = 1, create 20 random input # labelled as negative = 0 import numpy as np import torch as T device = T.device("cpu") # apply to Tensor or Module # ---------------------------------------------------------- class ExploreDataset(T.utils.data.Dataset): # label age city income job-type # 1 0.39 1 0 0 0.5432 1 0 0 # -2 0.29 0 0 1 0.4985 0 1 0 (unlabeled) # . . . # [0] [1] [2 3 4] [5] [6 7 8] def __init__(self, fn): self.rnd = np.random.RandomState(1) tmp_x = np.zeros((40,8), dtype=np.float32) tmp_y = np.zeros(40, dtype=np.float32) # 1. load just the 20 known positives into memory i = 0 f = open(fn, "r") for line in f: line = line.strip() if line.startswith("#"): continue arr = np.fromstring(line, sep="\t", dtype=np.float32) if int(arr[0]) == 1: # known positive tmp_y[i] = arr[0] tmp_x[i][0] = arr[1] tmp_x[i][1] = arr[2] tmp_x[i][2] = arr[3] tmp_x[i][3] = arr[4] tmp_x[i][4] = arr[5] tmp_x[i][5] = arr[6] tmp_x[i][6] = arr[7] tmp_x[i][7] = arr[8] i += 1 f.close() tmp_y = tmp_y.reshape(-1,1) # 2D # 2. create 20 synthetic items labelled as negative = 0 for i in range(20, 40): # tmp_y[i] = 0 # is already 0 tmp_x[i][0] = self.rnd.random() # age city = self.rnd.randint(0,3) if city == 0: tmp_x[i][1] = 1 elif city == 1: tmp_x[i][2] = 1 elif city == 2: tmp_x[i][3] = 1 tmp_x[i][4] = self.rnd.random() # income job = self.rnd.randint(0,3) if job == 0: tmp_x[i][5] = 1 elif job == 1: tmp_x[i][6] = 1 elif job == 2: tmp_x[i][7] = 1 self.x_data = T.tensor(tmp_x, dtype=T.float32).to(device) self.y_data = T.tensor(tmp_y, dtype=T.float32).to(device) def __len__(self): return len(self.x_data) def __getitem__(self, idx): preds = self.x_data[idx,:] # idx rows, all 8 cols lbl = self.y_data[idx,:] # idx rows, the only col sample = { 'predictors' : preds, 'lbl' : lbl } return sample # ---------------------------------------------------------- class Net(T.nn.Module): def __init__(self): super(Net, self).__init__() self.hid1 = T.nn.Linear(8, 10) # 8-(10-10)-1 self.hid2 = T.nn.Linear(10, 10) self.oupt = T.nn.Linear(10, 1) T.nn.init.xavier_uniform_(self.hid1.weight) T.nn.init.zeros_(self.hid1.bias) T.nn.init.xavier_uniform_(self.hid2.weight) T.nn.init.zeros_(self.hid2.bias) T.nn.init.xavier_uniform_(self.oupt.weight) T.nn.init.zeros_(self.oupt.bias) def forward(self, x): z = T.tanh(self.hid1(x)) z = T.tanh(self.hid2(z)) z = T.sigmoid(self.oupt(z)) # see BCELoss() below return z # ---------------------------------------------------------- def train(net, ds, bs, me, le, lr, verbose): # NN, dataset, batch_size, max_epochs, # log_every, learn_rate. optimizer and loss hard-coded. data_ldr = T.utils.data.DataLoader(ds, batch_size=bs, shuffle=True) loss_func = T.nn.BCELoss() # assumes sigmoid activation opt = T.optim.SGD(net.parameters(), lr=lr) for epoch in range(0, me): epoch_loss = 0.0 for (batch_idx, batch) in enumerate(data_ldr): X = batch['predictors'] # inputs Y = batch['lbl'] # 0 or 1 targets opt.zero_grad() # prepare gradients oupt = net(X) # compute output/target loss_val = loss_func(oupt, Y) # a tensor epoch_loss += loss_val.item() # accumulate for display loss_val.backward() # compute gradients opt.step() # update weights if epoch % le == 0 and verbose == True: print("epoch = %4d loss = %0.4f" % (epoch, epoch_loss)) # ---------------------------------------------------------- def main(): # 0. get started print("\nBegin PUL two-step: find reliables ") T.manual_seed(1) np.random.seed(1) # 1. create Dataset and DataLoader objects print("\nCreating Employee exploration Dataset ") pul_file = ".\\Data\\employee_pul_200.txt" train_ds = ExploreDataset(pul_file) # 2. create neural network print("\nCreating 8-(10-10)-1 binary NN classifier ") net = Net().to(device) net.train() # set mode # 3. train print("\nSetting training parameters: ") bat_size = 4 lrn_rate = 0.01 max_epochs = 2000 log_every = 500 print("batch size = " + str(bat_size)) print("lrn_rate = %0.2f " % lrn_rate) print("max_epochs = " + str(max_epochs)) print("loss function = BCELoss() ") print("optimizer = SGD ") print("\nStarting training") train(net, train_ds, bat_size, max_epochs, log_every, lrn_rate, verbose=True) print("Training complete ") # 4. score the 20 known positives print("\nScoring the 20 known positives ") min_score = 1.0; max_score = 0.0 net.eval() for i in range(20): x = train_ds[i]['predictors'] with T.no_grad(): p = net(x) if p.item() "lt" min_score: min_score = p.item() elif p.item() "gt" max_score: max_score = p.item() print("Min score for known positives: %0.4f" % min_score) print("Max score for known positives: %0.4f" % max_score) # 5. scan and score the unlabelled itemss. # if p-score is less than min_score, mark item as negative # if p-score is grtr than max_score, mark item as positive # because there's no training, no need Dataset # label age city income job-type # 1 0.39 1 0 0 0.5432 1 0 0 # -2 0.29 0 0 1 0.4985 0 1 0 (unlabeled) # . . . # [0] [1] [2 3 4] [5] [6 7 8] print("\nScanning unlabelled data ") pul_data = np.loadtxt(pul_file, usecols=[0,1,2,3,4,5,6,7,8], delimiter="\t", skiprows=0, comments="#", dtype=np.float32) for i in range(len(pul_data)): if i "gte" 4 and i "lte" 195: continue # just show a few x = T.tensor(pul_data[i][1:9], dtype=T.float32).to(device) with T.no_grad(): p = net(x) print("") print(x) print("score = %0.4f " % p.item()) if int(pul_data[i][0]) == 1: print("existing known positive class 1 item ") elif p.item() "lt" min_score * 0.90: print("marking this unlabelled as reliable negative class 0 ") elif p.item() "gt" max_score * 0.90: print("marking this unlabelled as reliable positive class 1 ") else: print("not enough evidence to mark this item") print("\nEnd PUL two-step find reliables demo") if __name__== "__main__": main()

You must be logged in to post a comment.