A problem that pops up over and over in machine learning and data science scenarios is the need to compute the similarity (or nearly equivalently, difference or distance) between two datasets. At first thought, this doesn’t seem difficult, but the problem is extemely difficult. Briefly, if you try to compare individual lines between datasets, you hit the combinatorial explosion problem — there are just too many comparisons. Additionally, there are the problems of dealing with different dataset sizes, and dealing with non-numeric data. The bottom line is that there is no simple way to calculate the similarity/distance between two datasets.

Several months ago, I came up with an interesting technique for the distance between datasets P and Q. First, transform each data line of P and Q into an embedding vector representation using a deep neural autoencoder. For example, if you set the embedding dim to 3, a line of data like (male, 34, engineer, $72,000.00, Fullerton, 0.7717) might be transformed into a vector like [0.3256, 0.8911, 0.7936].

Next, each dataset representation determines a frequency distribution for each latent variable. For example:

P 1: [0.10, 0.00, 0.05, 0.05, 0.20, 0.10, 0.00, 0.30, 0.10, 0.10] Q 1: [0.05, 0.10, 0.00, 0.15, 0.20, 0.10, 0.20, 0.10, 0.00, 0.10]

This means that for latent varaible 1 (of the three), in dataset P, 10% of the data items are between 0.00 and 0.10, 0% are between 0.10 and 0.20, 5% are between 0.20 and 0.30, and so on.

In my original thought, I figured to use the Kullback-Leibler divergence to compute the difference between the P and Q frequency distributions. But in my most recent thought I wondered about using Jensen-Shannon distance. So you compute the three distances between the three different P and Q distributions using JS. And last, you compute the average of the three JS distances to give the final distance between P and Q.

To test this idea, I coded it up using PyTorch. Then I created a reference dataset P that is 100 lines of the UCI Digits dataset. Each line represents a crude 8×8 handwritten digit and has 64 pixel values between 0 and 16 followed by the class label between 0 and 9. Then I made 10 different Q datasets where each is based on P but with a different percentage of lines where the values have been assigned randomly — 0%, 10%, 20% . . 100%. The idea is that if P and Q are the same, the distance should be 0, and as the percentage of randomization increases, the distance should increase.

My experiment worked very nicely and I was pleased.

*There aren’t very many true identical twins actors and actresses. Among the most well known are actors James and Oliver Phelps who played Fred and George Weasley in the “Harry Potter” series. It’s much easier to get a single actor to play dual roles when a twin is needed. Here are three other pairs of true identical twins. Left: Mary and Madeline Collinson as vampires in “Twins of Evil” (1971). Center: Peter and David Paul as barbarian warriors in “The Barbarians” (1987). Right: Leigh and Lynette Harris as sisters being menaced by the evil wizard Traigon in “Sorceress” (1982).*

Code below. Long.

# dataset_distance.py # for UCI Digits # distance between datasets P and Q # assumes P and Q data normalized to [0.0, 1.0] # for sampling and generation evaluation, etc. # PyTorch 1.8.0-CPU Anaconda3-2020.02 Python 3.7.6 # Windows 10 import numpy as np import time import torch as T device = T.device("cpu") # ----------------------------------------------------------- class Digits_Dataset(T.utils.data.Dataset): # for an Autoencoder (not a classifier) # assumes data is: # 64 pixel values (0-16) (comma) label (0-9) # [0] [1] . . [63] [64] def __init__(self, src_file): all_xy = np.loadtxt(src_file, usecols=range(65), delimiter=",", comments="#", dtype=np.float32) self.xy_data = T.tensor(all_xy, dtype=T.float32).to(device) self.xy_data[:, 0:64] /= 16.0 # normalize pixels self.xy_data[:, 64] /= 9.0 # normalize digit labels def __len__(self): return len(self.xy_data) def __getitem__(self, idx): xy = self.xy_data[idx] return xy # ----------------------------------------------------------- class Autoencoder(T.nn.Module): # 65-32-4-32-65 def __init__(self): super(Autoencoder, self).__init__() self.layer1 = T.nn.Linear(65, 32) # includes labels self.layer2 = T.nn.Linear(32, 4) self.layer3 = T.nn.Linear(4, 32) self.layer4 = T.nn.Linear(32, 65) self.latent_dim = 4 def encode(self, x): # [bs,65] z = T.tanh(self.layer1(x)) # [bs,32] z = T.sigmoid(self.layer2(z)) # [bs,4] return z def decode(self, x): # [bs,4] z = T.tanh(self.layer3(x)) # [bs,32] z = T.sigmoid(self.layer4(z)) # [bs,785] return z def forward(self, x): z = self.encode(x) oupt = self.decode(z) return oupt # ----------------------------------------------------------- def train(ae, ds, bs, me, le): # train autoencoder ae with dataset ds using batch size bs, # with max epochs me, log_every le data_ldr = T.utils.data.DataLoader(ds, batch_size=bs, shuffle=True) loss_func = T.nn.MSELoss() opt = T.optim.Adam(ae.parameters(), lr=0.001) print("Starting training") for epoch in range(0, me): for (b_idx, batch) in enumerate(data_ldr): opt.zero_grad() X = batch oupt = ae(X) loss_val = loss_func(oupt, X) # note X not Y loss_val.backward() opt.step() if epoch "gt" 0 and epoch % le == 0: print("epoch = %6d" % epoch, end="") print(" curr batch loss = %7.4f" % loss_val.item(), end="") print("") print("Training complete ") # ----------------------------------------------------------- def show_uci_file(fn, n_lines): fin = open(fn, "r") i = 0 for line in fin: line = line.strip() tokens = line.split(",") for j in range(0,3): # first three pixels print("%4s" % tokens[j], end="") print(" . . . ", end="") for j in range(30,32): # two middle pixels print("%4s" % tokens[j], end="") print(" . . . ", end="") for j in range(61,64): # last three pixels print("%4s" % tokens[j], end="") print(" ** ", end="") print("%3s" % tokens[64]) # label / digit i += 1 if i == n_lines: break fin.close() # ----------------------------------------------------------- # def make_random_subset_file(n_lines, src_fn, dest_fn): # # count lines in src # fin = open(src_fn, "r") # ct = 0 # for line in fin: # ct += 1 # fin.close() # # make array of rows to select # all_rows = np.arange(ct) # np.random.shuffle(all_rows) # # get some rows # selected_rows = all_rows[0:n_lines] # # write selected rows to dest file # fin = open(src_fn, "r") # fout = open(dest_fn, "w") # i = 0 # for line in fin: # if i in selected_rows: # fout.write(line) # includes new_line # i += 1 # fout.close() # fin.close() # ----------------------------------------------------------- def value_to_bin(x): # x is in [0.0, 1.0] if x "gte" 0.0 and x "lt" 0.1: return 0 elif x "gte" 0.1 and x "lt" 0.2: return 1 elif x "gte" 0.2 and x "lt" 0.3: return 2 elif x "gte" 0.3 and x "lt" 0.4: return 3 elif x "gte" 0.4 and x "lt" 0.5: return 4 elif x "gte" 0.5 and x "lt" 0.6: return 5 elif x "gte" 0.6 and x "lt" 0.7: return 6 elif x "gte" 0.7 and x "lt" 0.8: return 7 elif x "gte" 0.8 and x "lt" 0.9: return 8 else: return 9 # ----------------------------------------------------------- def make_freq_mat(ae, ds): d = ae.latent_dim result = np.zeros((d,10), dtype=np.int64) # [lat_dim,10] n = len(ds) for i in range(n): x = ds[i] with T.no_grad(): latents = ae.encode(x) # like T([0.74, 0.45, 0.56, 0.54]) latents = latents.numpy() # convert from tensor for j in range(d): bin = value_to_bin(latents[j]) result[j][bin] += 1 result = (result * 1.0) / n return result # ----------------------------------------------------------- def KL(p_freq, q_freq): # Kullback-Leibler "from q to p" # p and q are np array prob distributions n = len(p_freq) sum = 0.0 for i in range(n): sum += p_freq[i] * np.log(p_freq[i] / q_freq[i]) return sum def js_distance(p_freq, q_freq): m = 0.5 * (p_freq + q_freq) # avg of P and Q left = KL(p_freq, m) right = KL(q_freq, m) return np.sqrt((left + right) / 2) # ----------------------------------------------------------- def main(): # 0. get started print("\nBegin UCI Digits neural dataset distance demo ") T.manual_seed(1) np.random.seed(1) p_file = ".\\Data\\uci_digits_train_100.txt" # ref dataset # q_file = ".\\Data\\uci_digits_random_80.txt" # subset # make_random_subset_file(80, p_file, q_file) # q_file = ".\\Data\\uci_digits_train_100.txt" # dist = 0 # q_file = ".\\Data\\digits_100_noise_10.txt" # 0.0660 # q_file = ".\\Data\\digits_100_noise_20.txt" # 0.1131 # q_file = ".\\Data\\digits_100_noise_30.txt" # 0.1689 # q_file = ".\\Data\\digits_100_noise_40.txt" # 0.2163 q_file = ".\\Data\\digits_100_noise_50.txt" # 0.2563 # q_file = ".\\Data\\digits_100_noise_60.txt" # 0.2893 # q_file = ".\\Data\\digits_100_noise_70.txt" # 0.3469 # q_file = ".\\Data\\digits_100_noise_80.txt" # 0.4053 # q_file = ".\\Data\\digits_100_noise_90.txt" # 0.4370 # q_file = ".\\Data\\digits_100_noise_100.txt" # 0.5401 print("\nP file = " + str(p_file)) print("Q file = " + str(q_file)) show_uci_file(q_file, 5) # 1. create Dataset objects print("\nCreating P and Q Datasets ") p_ds = Digits_Dataset(p_file) q_ds = Digits_Dataset(q_file) # 2. create and train autoencoder model using parent print("\nCreating autoencoder using P \n") autoenc = Autoencoder() # 65-32-4-32-65 autoenc.train() # set mode bat_size = 10 max_epochs = 100 log_every = 20 train(autoenc, p_ds, bat_size, max_epochs, log_every) # 3. TODO: save trained autoencoder # 4. create frequency matrices for reference P and other Q print("\nCreating frequency arrays for P, Q ") p_freq = make_freq_mat(autoenc, p_ds) q_freq = make_freq_mat(autoenc, q_ds) p_freq += 1.0e-5 # avoid divide by 0 q_freq += 1.0e-5 np.set_printoptions(formatter={'float': '{: 0.3f}'.format}) print(p_freq) print("") print(q_freq) # 5. compute an f-divergence: KL, chi-square, etc. print("\nComputing JS distances for latent dim values") sum = 0.0 for j in range(4): # one JS for each latent variable js = js_distance(p_freq[j], q_freq[j]) print("%0.5f" % js) sum += js dist = sum / 4 print("\nDistance(P,Q) = %0.4f " % dist) print("\nEnd UCI Digits neural dataset distance demo") # ----------------------------------------------------------- if __name__ == "__main__": main()

You must be logged in to post a comment.