I wrote an article titled “Computing the Similarity Between Two Machine Learning Datasets” in the September 2021 edition of Microsoft Visual Studio Magazine. See https://visualstudiomagazine.com/articles/2021/09/20/dataset-similarity.aspx.

A common task in many machine learning scenarios is the need to compute the similarity (or difference or distance) between two datasets. For example, if you select a sample from a huge set of training data, you likely want to know how similar the sample dataset is to the source dataset. Or if you want to prime the training for a very deep neural network, you need to find an existing model that was trained using a dataset that is most similar to your new dataset.

At first thought, computing the similarity/distance between two datasets sounds easy, but in fact the problem is extremely difficult. If you try to compare individual lines between datasets, you quickly run into the combinatorial explosion problem — there are just too many comparisons. There are also the related problems of dealing with different dataset sizes, and dealing with non-numeric data.

My article explains how to compute the distance between any two datasets, P and Q, using a combination of neural and classical statistics techniques. Briefly, you use the P dataset as a reference and construct a neural autoencoder. You apply the autoencoder to the P and Q datasets to convert each data item to a condensed representation. Then you construct frequency distributions for the P and Q condensed representations. Then you compute the similarity / distance / difference between the two frequency distributions. This value measures the similarity / distance / difference between the two datasets.

The technique is very clever, if I do say so myself.

The technique solves the non-numeric data issue (you can encode categorical data for an autoencoder), the different dataset sizes problem (the resulting frequency distributions have the same size), and the combinatorial explosion problem (the technique doesn’t directly compare items in the wo different datasets).

The major disadvantage of the technique is that there are a lot of hyperparameters. Neural architecture (number hidden layers, number nodes each layer including the size of internal latent dim), neural training (batch size, optimization algorithm, max epochs), frequency-related (number of bins to place data items into).

*There are techniques that quantify the similarity / difference between images, but those techniques aren’t as good as the human eye in many cases. Here are three illustrations that I’d say have some similarity to each other, even though quantitatively the similarity is near zero. Left: By Wilton Williams. Center: By Ernesto Cabral. Right: By George Barbier.*

You must be logged in to post a comment.