Tomek Links for Pruning Imbalanced Data

Imbalanced data occurs when you have machine learning training data with many items of one class and very few items of the other class. For example, some medical data might have many thousands of data items that are “no disease” (class 0) but only a few data items that are “disease” (class 1).

Most ML prediction systems don’t work well with highly imbalanced data because the majority class items overwhelm the minority class items. So, in most cases you must prune away some of the majority class items and/or generate some new synthetic minority class items.

The idea of Tomek links is to identify majority class data items to delete. In a nutshell, a Tomek link occurs between two data items that have different classes, but are the nearest neighbors to each other. The idea is best understood by a diagram:

In the diagram, look at item (0.2, 0.3), which is class 0. The nearest neighbor to (0.2 0.3) is (0.3, 0.2) and because it is a different class, the two items form a Tomek link.

On the other hand, the data item at (0.3, 0.9) has a nearest neighbor at (0.5, 0.9) but because both items are the same class, they don’t form a Tomek link.

The data in the diagram has a second Tomek link between (0.7, 0.4) and (0.8, 0.4).

Tomek links usually occur near a decision boundary (the pair in the lower left) or when one of the two items is “noise” (the pair on the right).

When you find a pair of data items that form a Tomek link, the item that has the majority class is a good candidate for deleting because it is either ambiguous (when near a decision boundary) or noise.

As a general rule of thumb, you must be very cautious when pruning away items from imbalanced datasets. However, you must be cautious when generating synthetic minority class data too because you might mask majority data.

A bit too much fun while drinking can lead to personal imbalance, which in turn can lead to beer-box masking. I know this from personal experience in college but the miracle of Internet image search provides concrete visual evidence too.

This entry was posted in Machine Learning. Bookmark the permalink.

Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s