I was exploring anomaly detection using isolation forests and got some strange results that make me skeptical of the technique. An isolation forest takes some data then repeatedly constructs binary trees where there split at each branch is random based on a single column/feature. The idea is that unusual values in a column will be branched quickly and therefore data items with unusual values will be close to the tree root. If you do this repeatedly and track the average depth of an item, items with small average depth are anomalous.
In my mind I wondered how this algorithm could deal with a situation where a data item has two features, and each feature value is common but the combination of the two is very unusual. For example, suppose you have a bunch of people data with sex, age, income. The 20-29 year olds all have income in the $30,000s. The 50-59 year olds all have income in the $60,000s. But if you see a 20-year old with an income of $62,000, that person should definitely be flagged as an extreme anomaly.
I created such a 20-item dummy data file and ran it through the isolation forest implementation in the scikit-learn code library. With all normal data, weirdly, the isolation forest flagged four items as normal (+1) and the other 16 items as anomalies (-1). Hmm. Strange.
Then I changed one data item from (0, 0.22, 37), meaning a male, 22-year old who makes $37,000, to (0, 0.22, 62) — same person now makes $62,000 which is far more than any other person in their 20s. With the modified data, the isolation forest got quite different results that didn’t make much sense. Although the high-earning 22-year old was flagged as anomalous, it wasn’t flagged as anomalous as a 20-year old who makes $39,000 or the 59-year old who makes $60,000.
It looks as though the isolation forest is only looking at extreme values in individual features, and isn’t finding anomalies that result from interactions between features. For sure, there are many isolation forest parameters that I didn’t explore, and my dummy dataset was tiny, but still, the isolation forest anomaly detection results were strange.
Left: “Mystery Men” (1999) featured a collection of third-rate superheros including The Shoveler who is very competent are shoveling dirt, Mr. Furious who has the ability to get very angry, and The Blue Raja who can throw forks and spoons with great accuracy. Very funny movie.
Center: “Supervized” (2019) was set in a retirement home for superheros. The retirees included Pendle, Shimmy, and Ray. An OK film but not as good as the other two I show here.
Right: “Superhero Movie” (2008) mostly spoofs Spiderman but also has references to X-Men, the Fantastic Four and other movies. An uneven movie but some parts of it are hilarious.
Here’s the demo code with embedded data:
# iforest_weak_demo.py import numpy as np from sklearn.ensemble import IsolationForest def main(): print("\nBegin scikit Isolation Forest demo ") np.random.seed(1) data = np.array([ [0,0.20,39], [1,0.21,38], # [0,0.22,37], [0,0.22,62], [1,0.23,36], [0,0.24,35], [1,0.25,34], [0,0.26,33], [1,0.27,32], [0,0.28,31], [1,0.29,30], [0,0.50,69], [1,0.51,68], [0,0.52,67], [1,0.53,66], [0,0.54,65], [1,0.55,64], [0,0.56,63], [1,0.57,62], [0,0.58,61], [1,0.59,60]], dtype=np.float32) np.set_printoptions(precision=4, suppress=True, linewidth=100) print("\nData: ") print(data) iso_for = IsolationForest() model = iso_for.fit(data) print("\nPredcitions: ") predictions = model.predict(data) print(predictions) np.set_printoptions(precision=3, suppress=True, linewidth=40) print("\nAnomaly scores: ") scores = iso_for.score_samples(data) print(scores) print("\nEnd demo ") if __name__ == "__main__": main()