Help! IndexError when trying to explain an IsolationForest.
I am using Scikit-learn
's IsolationForest
for anomaly detection. Usually, the datasets I use have more than one variable - but sometimes they only have one. This works for fitting and predicting the model. However, for explaining the model's output using shap
's TreeExplainer
, I get an IndexError
.
See below for a minimal reproducible example:
import pandas as pd
import numpy as np
from sklearn.ensemble import IsolationForest
from shap import TreeExplainer
df = pd.DataFrame()
df['Column1'] = np.random.randint(0, 100, 100)
model = IsolationForest()
model.fit(X=df)
explainer = TreeExplainer(model)
The root cause of the problem seems to be the following (see code below): each IsolationForest
has multiple isolation trees. In the TreeExplainer
, multiple IsoTree
objects are initialised. During initialisation, this line crashes, because self.features
, a list, contains -2, which is out of bounds since tree_features
is just an array ([0]
). So maybe the problem is that when fitting the IsolationForest
, the wrong values are given for self.features
.
# re-number the features if each tree gets a different set of features
self.features = np.where(self.features >= 0, tree_features[self.features], self.features)
Any idea how to fix this?
Of course, for a univariate model, using Shapley values is pointless since you could just use the anomaly scores from score_samples
. I plan to use this as a workaround, but surely there's a more elegant way where this would not be required?
Thanks and best wishes,
Alexander