A Quick Look at Isolation Forests for Anomaly Detection

I was reading a research paper about anomaly detection recently and it mentioned a technique called isolation forest. I decided to take a quick look at the technique. Briefly, starting with a set of data, an isolation tree partitions the data randomly, several times, into a tree structure. Anomalous items will be close the the root node. Ordinary items will be farther from the root node. If you make many such random tress (a forest) and compute the average distance of an item from the roots, you get an anomaly score.

In the first run, items [1] and [3] are flagged as anomalies. In the second run, items [1], [3] and [9] are flagged.

The scikit library has an implementation of an isolation forest so I coded up a quick demo. The key lines are:

  data = rng.randn(10,3)
  iso_for = IsolationForest(max_samples=10)
  model = iso_for.fit(data)
  predictions = model.predict(data)
  scores = iso_for.score_samples(data)

I generated 10 random Gaussian vectors (mean = 0.0, std = 1.0) of three values each. The fit() method creates a prediction model. The predict() method emits a 1 for ordinary data items and a -1 for anomalies. In a first run, the predictions were

[ 1 -1  1 -1  1  1  1  1  1  1]

This means the isolation forest model flagged items [1] and [3] as anomalies. The predictions are based on an anomaly score. The corresponding scores were:

[-0.47 -0.59 -0.47 -0.59 -0.49 -0.45 -0.41 -0.43 -0.39 -0.49]

Score values that are less than -0.50 or greater than +0.50 are anomalies. Score values between -0.50 and +0.50 are not anomalies. The score values give you a way to compare how anomalous items are.

The isolation forest technique for anomaly detection has advantages and disadvantages compared to alternative techniques such as k-means clustering and autoencoder reconstruction error.

Isolation forest advantages include: fast, simple, it doesn’t use distance so it works, in theory, with categorical and numeric data, normalization not needed. Disadvantages include: has many hyperparameters that must be tuned, the randomness component means different runs can give different results (the second run of my demo gave slightly different results), doesn’t do well at finding an anomalous item closely surrounded by ordinary items (“swamping”), doesn’t do well at identifying an ordinary item closely surrounded by anomalous items (“masking”).

My initial impression is: I’m intrigued by the isolation forest for anomaly detection technique but need more information to form a solid opinion. Isolation forest anomaly detection could be useful as part of an ensemble approach that uses several anomaly detection techniques.

Left: “Alien Forest” by Ryan Gitter. Right: “Alien Forest” by Monica Langlois.

This entry was posted in Machine Learning. Bookmark the permalink.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s