Dataset Similarity Using Autoencoded Jensen-Shannon Distance

A problem that pops up over and over in machine learning and data science scenarios is the need to compute the similarity (or nearly equivalently, difference or distance) between two datasets. At first thought, this doesn’t seem difficult, but the problem is extemely difficult. Briefly, if you try to compare individual lines between datasets, you hit the combinatorial explosion problem — there are just too many comparisons. Additionally, there are the problems of dealing with different dataset sizes, and dealing with non-numeric data. The bottom line is that there is no simple way to calculate the similarity/distance between two datasets.

Several months ago, I came up with an interesting technique for the distance between datasets P and Q. First, transform each data line of P and Q into an embedding vector representation using a deep neural autoencoder. For example, if you set the embedding dim to 3, a line of data like (male, 34, engineer, $72,000.00, Fullerton, 0.7717) might be transformed into a vector like [0.3256, 0.8911, 0.7936].

Next, each dataset representation determines a frequency distribution for each latent variable. For example:

P 1: [0.10, 0.00, 0.05, 0.05, 0.20, 0.10, 0.00, 0.30, 0.10, 0.10]
Q 1: [0.05, 0.10, 0.00, 0.15, 0.20, 0.10, 0.20, 0.10, 0.00, 0.10]

This means that for latent varaible 1 (of the three), in dataset P, 10% of the data items are between 0.00 and 0.10, 0% are between 0.10 and 0.20, 5% are between 0.20 and 0.30, and so on.

In my original thought, I figured to use the Kullback-Leibler divergence to compute the difference between the P and Q frequency distributions. But in my most recent thought I wondered about using Jensen-Shannon distance. So you compute the three distances between the three different P and Q distributions using JS. And last, you compute the average of the three JS distances to give the final distance between P and Q.

To test this idea, I coded it up using PyTorch. Then I created a reference dataset P that is 100 lines of the UCI Digits dataset. Each line represents a crude 8×8 handwritten digit and has 64 pixel values between 0 and 16 followed by the class label between 0 and 9. Then I made 10 different Q datasets where each is based on P but with a different percentage of lines where the values have been assigned randomly — 0%, 10%, 20% . . 100%. The idea is that if P and Q are the same, the distance should be 0, and as the percentage of randomization increases, the distance should increase.

My experiment worked very nicely and I was pleased.



There aren’t very many true identical twins actors and actresses. Among the most well known are actors James and Oliver Phelps who played Fred and George Weasley in the “Harry Potter” series. It’s much easier to get a single actor to play dual roles when a twin is needed. Here are three other pairs of true identical twins. Left: Mary and Madeline Collinson as vampires in “Twins of Evil” (1971). Center: Peter and David Paul as barbarian warriors in “The Barbarians” (1987). Right: Leigh and Lynette Harris as sisters being menaced by the evil wizard Traigon in “Sorceress” (1982).


Code below. Long.

# dataset_distance.py

# for UCI Digits

# distance between datasets P and Q
# assumes P and Q data normalized to [0.0, 1.0]
# for sampling and generation evaluation, etc.

# PyTorch 1.8.0-CPU Anaconda3-2020.02  Python 3.7.6
# Windows 10

import numpy as np
import time
import torch as T

device = T.device("cpu")

# -----------------------------------------------------------

class Digits_Dataset(T.utils.data.Dataset):
  # for an Autoencoder (not a classifier)
  # assumes data is:
  # 64 pixel values (0-16) (comma) label (0-9)
  # [0] [1] . . [63] [64] 

  def __init__(self, src_file):
    all_xy = np.loadtxt(src_file, usecols=range(65),
      delimiter=",", comments="#", dtype=np.float32)
    self.xy_data = T.tensor(all_xy, dtype=T.float32).to(device) 
    self.xy_data[:, 0:64] /= 16.0  # normalize pixels
    self.xy_data[:, 64] /= 9.0     # normalize digit labels

  def __len__(self):
    return len(self.xy_data)

  def __getitem__(self, idx):
    xy = self.xy_data[idx]
    return xy

# -----------------------------------------------------------

class Autoencoder(T.nn.Module):  # 65-32-4-32-65
  def __init__(self):
    super(Autoencoder, self).__init__()  
    self.layer1 = T.nn.Linear(65, 32)  # includes labels
    self.layer2 = T.nn.Linear(32, 4)
    self.layer3 = T.nn.Linear(4, 32)
    self.layer4 = T.nn.Linear(32, 65)

    self.latent_dim = 4

  def encode(self, x):             # [bs,65]
    z = T.tanh(self.layer1(x))     # [bs,32]
    z = T.sigmoid(self.layer2(z))  # [bs,4]
    return z 

  def decode(self, x):              # [bs,4]
    z = T.tanh(self.layer3(x))      # [bs,32]
    z = T.sigmoid(self.layer4(z))   # [bs,785]
    return z 

  def forward(self, x):
    z = self.encode(x)
    oupt = self.decode(z)
    return oupt

# -----------------------------------------------------------

def train(ae, ds, bs, me, le):
  # train autoencoder ae with dataset ds using batch size bs, 
  # with max epochs me, log_every le
  data_ldr = T.utils.data.DataLoader(ds, batch_size=bs,
    shuffle=True)
  
  loss_func = T.nn.MSELoss() 
  opt = T.optim.Adam(ae.parameters(), lr=0.001)
  print("Starting training")
  for epoch in range(0, me):
    for (b_idx, batch) in enumerate(data_ldr):
      opt.zero_grad()
      X = batch
      oupt = ae(X)
      loss_val = loss_func(oupt, X)  # note X not Y
      loss_val.backward()
      opt.step()

    if epoch "gt" 0 and epoch % le == 0:
      print("epoch = %6d" % epoch, end="")
      print("  curr batch loss = %7.4f" % loss_val.item(), end="")
      print("")
  print("Training complete ")

# -----------------------------------------------------------

def show_uci_file(fn, n_lines):
  fin = open(fn, "r")
  i = 0
  for line in fin:
    line = line.strip()
    tokens = line.split(",")
    for j in range(0,3):                 # first three pixels
      print("%4s" % tokens[j], end="")
    print(" . . . ", end="")
    for j in range(30,32):               # two middle pixels
      print("%4s" % tokens[j], end="")
    print(" . . . ", end="")
    for j in range(61,64):               # last three pixels  
      print("%4s" % tokens[j], end="")
    print("   ** ", end="")
    print("%3s" % tokens[64])           # label / digit

    i += 1
    if i == n_lines: break
  fin.close()

# -----------------------------------------------------------

# def make_random_subset_file(n_lines, src_fn, dest_fn):
#   # count lines in src
#   fin = open(src_fn, "r")
#   ct = 0
#   for line in fin:
#     ct += 1
#   fin.close()

#   # make array of rows to select
#   all_rows = np.arange(ct)
#   np.random.shuffle(all_rows)
#   # get some rows
#   selected_rows = all_rows[0:n_lines]
#   # write selected rows to dest file
#   fin = open(src_fn, "r")
#   fout = open(dest_fn, "w")
#   i = 0
#   for line in fin:
#     if i in selected_rows:
#       fout.write(line)  # includes new_line
#     i += 1
#   fout.close()
#   fin.close()

# -----------------------------------------------------------

def value_to_bin(x):
  # x is in [0.0, 1.0]
  if x "gte" 0.0 and x "lt" 0.1:   return 0
  elif x "gte" 0.1 and x "lt" 0.2: return 1
  elif x "gte" 0.2 and x "lt" 0.3: return 2
  elif x "gte" 0.3 and x "lt" 0.4: return 3
  elif x "gte" 0.4 and x "lt" 0.5: return 4
  elif x "gte" 0.5 and x "lt" 0.6: return 5
  elif x "gte" 0.6 and x "lt" 0.7: return 6
  elif x "gte" 0.7 and x "lt" 0.8: return 7
  elif x "gte" 0.8 and x "lt" 0.9: return 8
  else:                            return 9 

# -----------------------------------------------------------

def make_freq_mat(ae, ds):
  d = ae.latent_dim
  result = np.zeros((d,10), dtype=np.int64)  # [lat_dim,10]
  n = len(ds)
  for i in range(n):
    x = ds[i]
    with T.no_grad():
      latents = ae.encode(x)  # like T([0.74, 0.45, 0.56, 0.54])
    latents = latents.numpy() # convert from tensor
    for j in range(d):
      bin = value_to_bin(latents[j])
      result[j][bin] += 1
  result = (result * 1.0) / n
  return result 

# -----------------------------------------------------------

def KL(p_freq, q_freq):
  # Kullback-Leibler "from q to p"
  # p and q are np array prob distributions
  n = len(p_freq)
  sum = 0.0
  for i in range(n):
    sum += p_freq[i] * np.log(p_freq[i] / q_freq[i])
  return sum

def js_distance(p_freq, q_freq):
  m = 0.5 * (p_freq + q_freq)  # avg of P and Q
  left = KL(p_freq, m)
  right = KL(q_freq, m)
  return np.sqrt((left + right) / 2)

# -----------------------------------------------------------

def main():
  # 0. get started
  print("\nBegin UCI Digits neural dataset distance demo ")
  T.manual_seed(1)
  np.random.seed(1)
  p_file = ".\\Data\\uci_digits_train_100.txt"  # ref dataset

  # q_file = ".\\Data\\uci_digits_random_80.txt"  # subset 
  # make_random_subset_file(80, p_file, q_file)

  # q_file = ".\\Data\\uci_digits_train_100.txt"  # dist = 0
  # q_file = ".\\Data\\digits_100_noise_10.txt"   # 0.0660
  # q_file = ".\\Data\\digits_100_noise_20.txt"   # 0.1131
  # q_file = ".\\Data\\digits_100_noise_30.txt"   # 0.1689
  # q_file = ".\\Data\\digits_100_noise_40.txt"   # 0.2163
  q_file = ".\\Data\\digits_100_noise_50.txt"   # 0.2563
  # q_file = ".\\Data\\digits_100_noise_60.txt"   # 0.2893
  # q_file = ".\\Data\\digits_100_noise_70.txt"   # 0.3469 
  # q_file = ".\\Data\\digits_100_noise_80.txt"   # 0.4053
  # q_file = ".\\Data\\digits_100_noise_90.txt"   # 0.4370
  # q_file = ".\\Data\\digits_100_noise_100.txt"  # 0.5401

  print("\nP file = " + str(p_file))
  print("Q file = " + str(q_file))
  show_uci_file(q_file, 5)
  
  # 1. create Dataset objects
  print("\nCreating P and Q Datasets ")
  p_ds = Digits_Dataset(p_file)
  q_ds = Digits_Dataset(q_file)

  # 2. create and train autoencoder model using parent 
  print("\nCreating autoencoder using P \n")
  autoenc = Autoencoder()   # 65-32-4-32-65
  autoenc.train()           # set mode

  bat_size = 10
  max_epochs = 100
  log_every = 20
  train(autoenc, p_ds, bat_size, max_epochs,
    log_every)

  # 3. TODO: save trained autoencoder

  # 4. create frequency matrices for reference P and other Q
  print("\nCreating frequency arrays for P, Q ")
  p_freq = make_freq_mat(autoenc, p_ds)
  q_freq = make_freq_mat(autoenc, q_ds)
  p_freq += 1.0e-5  # avoid divide by 0
  q_freq += 1.0e-5
  np.set_printoptions(formatter={'float': '{: 0.3f}'.format})
  print(p_freq)
  print("")
  print(q_freq)

  # 5. compute an f-divergence: KL, chi-square, etc.
  print("\nComputing JS distances for latent dim values")
  sum = 0.0
  for j in range(4):  # one JS for each latent variable
    js = js_distance(p_freq[j], q_freq[j])
    print("%0.5f" % js)
    sum += js
  dist = sum / 4
  print("\nDistance(P,Q) = %0.4f " % dist)

  print("\nEnd UCI Digits neural dataset distance demo")

# -----------------------------------------------------------

if __name__ == "__main__":
  main()
This entry was posted in PyTorch. Bookmark the permalink.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s