IndexError for Shap TreeExplainer with Univariate IsolationForest

240 views Asked by At

Help! IndexError when trying to explain an IsolationForest.

I am using Scikit-learn's IsolationForest for anomaly detection. Usually, the datasets I use have more than one variable - but sometimes they only have one. This works for fitting and predicting the model. However, for explaining the model's output using shap's TreeExplainer, I get an IndexError.

See below for a minimal reproducible example:

import pandas as pd
import numpy as np
from sklearn.ensemble import IsolationForest
from shap import TreeExplainer

df = pd.DataFrame()
df['Column1'] = np.random.randint(0, 100, 100)
model = IsolationForest()
model.fit(X=df)
explainer = TreeExplainer(model)

The root cause of the problem seems to be the following (see code below): each IsolationForest has multiple isolation trees. In the TreeExplainer, multiple IsoTree objects are initialised. During initialisation, this line crashes, because self.features, a list, contains -2, which is out of bounds since tree_features is just an array ([0]). So maybe the problem is that when fitting the IsolationForest, the wrong values are given for self.features.

# re-number the features if each tree gets a different set of features
self.features = np.where(self.features >= 0, tree_features[self.features], self.features)

Any idea how to fix this?

Of course, for a univariate model, using Shapley values is pointless since you could just use the anomaly scores from score_samples. I plan to use this as a workaround, but surely there's a more elegant way where this would not be required?

Thanks and best wishes,

Alexander

0

There are 0 answers