Some Thoughts About Dealing With Imbalanced Training Data

Suppose you have a binary classification problem where there are many of one class, but very few of the other class. For example, with medical data, you might have many thousands of data items representing people, that are class 0 (no disease), but only a few dozen items that are class 1 (have the disease). Or with security data, you could have thousands of data items that are class 0 (normal) but only a few items that are class 1 (security risk). Such datasets are called imbalanced.

If you train a prediction model using all of the imbalanced training data, the items with the majority class will overwhelm the items with the minority class. The resulting model will likely predict the majority class for any input.

The same ideas apply to multi-class classification. For simplicity, I’ll assume a binary classification scenario.

There are two approaches for dealing with imbalanced training data. You can delete some of the majority items so you have roughly equal numbers of both classes. Or you can generate new synthetic minority items. In practice, best results are usually obtained by combining techniques: delete some majority class items and also generate some synthetic minority class items.

Both of these two general approaches have dozens of variations. The fact that there are dozens of techniques indicates that no one technique works best.

A typical example of generating synthetic data is the SMOTE technique (“synthetic minority over-sampling technique”). It’s very simple. You repeatedly select a minority class data item, A, at random. Then you find the five nearest neighbor items of A, call them (B, C, D, E, F). Then you randomly select one of the nearest neighbors, suppose it’s E. Then you construct a new synthetic minority item by computing the average of A and E (with small random noise added).

There are many variations of the SMOTE technique. But they all assume that your data items are strictly numeric — you can’t directly find k-nearest neighbors on categorical data, and you can’t find an average of two categorical data items.

A typical example of deleting majority class items is called “down-sample and up-weight”. In its most basic form, you delete 50% (or whatever) of randomly selected majority class items, a factor of 2. Then, during training, when you compute the loss value for a majority item, you multiply the loss by 2. The idea is that there are really twice as many majority items as you’re training on, so you should weight a majority class item twice as much. This approach seems odd at first but because you have roughly equal numbers of majority and minority class items during training, the majority class items are less likely to overwhelm minority class items.

There are many variations of the down-sample and up-weight technique. An interesting paper written by one of my work colleagues starts by training a crude (but fast to create) prediction model on all data, and then uses the loss information generated by the crude model to intelligently select majority items to delete.

In general, whenever I read about a problem that uses a classical statistics or classical ML technique (such as k-nearest neighbors), I ponder using one of the deep neural techniques in my personal ML mental toolkit.

My first thought was that generating synthetic minority class items sounds like a problem that is well-suited for a variational autoencoder (VAE). A VAE is specifically designed to generate synthetic data. So, you’d train a VAE on the minority class items, then use the trained VAE to generate synthetic minority class items. Simple.

One advantage of the VAE approach compared to SMOTE is that a VAE can work with both numeric and categorical data. However, a VAE is much more complex than a k-NN based approach — possibly too complex for an average data scientist. I zapped together a proof of concept using PyTorch, where I created and trained a VAE to generate synthetic ‘1’ digits from the UCI Digits dataset. Each image is 8 by 8 grayscale pixels. The synthetic ‘1’ looked decent enough. However, when I generated several synthetic ‘1’ digits, they seemed too close to each other. This suggest my VAE is too good — it’s overfitting. One way to deal with a VAE that overfits is to adjust its architecture

As always, thought experiments are a good start, but I’d need to code some experiments to see what will actually happen.


No face has features that are perfectly balanced. In fact, a certain amount of imbalance in facial features contributes to attractiveness. Here are four attractive celebrities who have one eye that is noticeably smaller than the other. From left to right: Paris Hilton, Ariana Grande, Ryan Gosling, Angelina Jolie.


This entry was posted in Machine Learning. Bookmark the permalink.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s