How to plot misclassified samples in shap?

1.6k views Asked by At

I have a dataset of genes scored between 0 to 1 of the likelihood of causing disease (genes scored at 1 are known to cause disease, and a gene scored at say 0.74 is probable to cause disease). I am trying to build a machine learning model to predict that disease score for new genes in regression classification.

I want to look at a shap decision plot for the genes which are known disease genes but are being scored low (e.g. genes that are scored 1 but my model is scoring less than 0.8). I am struggling to group these genes together to plot.

My data looks like:

X:
Index   Feature1  Feature2   ... FeatureN
Gene1     1           0.2          10
Gene2     1           0.1          7
Gene3     0           0.3          10
#index is actually the index and not a column

Y:
Score
1
0.6
0.4

I run an xgboost regressor with nested cross validation and look at the MSE, predicted r2, and plot the observed vs expected values. I can see in the observed vs expected plot that the genes scored 1 in Y have many low scores predicted by the model, I want to understand why the model is doing this with using shap. I cannot give example data unfortunately.

I am trying to adapt example shap code given for label classification:

import shap

xgbr = xgboost.XGBRegressor()
xgbr.fit(X_train, Y_train)

select = range(8) #I have 8 features after feature selection with BorutaShap
features = X.iloc[select]
features_display = X.loc[features.index]

explainer = shap.TreeExplainer(xgbr)
expected_value = explainer.expected_value

#Example code from https://slundberg.github.io/shap/notebooks/plots/decision_plot.html: 

y_pred = xgbr.predict(X) 
y_pred = (shap_values.sum(1) + expected_value) > 0
misclassified = y_pred != y_test[select]
shap.decision_plot(expected_value, shap_values, features_display, link='logit', highlight=misclassified)

How do I select y_pred so it is predictions/genes that were meant to be 1 but were actually lower than 0.8 (or any low number)?

edit: in response to the given answer I have tried:

explainer = shap.TreeExplainer(xgbr)
shap_values = explainer.shap_values(X_test)

y_pred = xgbr.predict(X_test)
m = (y_pred <= 0.5) & (Y_test == 1)

shap.initjs()
shap.decision_plot(explainer.expected_value, shap_values,  X_test[m],  return_objects=True)

This runs but m is a length of 171 (the entire number of rows in my Y_test data) and the plot then plots all 171 it looks like - and I know from looking at the data there should be only one gene that is <= 0.5 and but is actually scored at 1.

2

There are 2 answers

7
yatu On BEST ANSWER

First of all, you mention predict that disease score for new genes in regression classification, what do you mean? The output seems to be binary, 0 or 1, hence this is a binary classification problem. You should be using xgboost's classifier instead. Update: Let's assume a regression problem though, as per the comments, to simulate your case. Though for the example bellow we should set 'objective':'multi:softmax' to output the actual labels.

As per your question, it seems that what you're trying to do is to index the test set on those samples that where not correctly predicted, and analyse the misleading features, which makes a reasonable amount of sense.

Let's reproduce your problem with some sample dataset:

from sklearn.datasets import load_iris

from sklearn.model_selection import train_test_split
import shap
import xgboost

X,y = shap.datasets.iris()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

model = xgboost.train(params={"learning_rate": 0.01}, 
                      dtrain=xgboost.DMatrix(X_train, label=y_train), 
                      num_boost_round =100)

A SHAP plot using the entire test set, is straight forward. For a force_plot for instance:

explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)

shap.initjs()
shap.force_plot(explainer.expected_value, shap_values, X_test)

enter image description here

Now, if we want to do the same with the missclassified samples, we need to look at the output probabilities. Since the iris dataset has several classes, let's say we want to visuallize the force_plot for those samples that should've been classified as 2, but we instead have an output value bellow 1.7:

y_pred = model.predict(xgboost.DMatrix(X_test))
m = (y_pred <= 1.7) & (y_test == 2)

Now we lets use the mask to perform boolean indexing on the X_test set, and update the shap_values:

shap.initjs()
c= explainer.shap_values(X_test[m])
shap.force_plot(explainer.expected_value, shap_values, X_test[m])

enter image description here

This is telling us that it is the petal length and width that are mostly pushing the regression to a higher value. Hence they presumably are the variables that are playing a major part in the misclassification.

Similarly, for a decision_plot:

shap.decision_plot(explainer.expected_value, shap_values, 
                   X_test[m], feature_order='hclust', 
                   return_objects=True)

enter image description here

1
igrinis On

Since I don't have your dataset, I can't check the code, but here some thoughts that may show you the direction.

It seem that you do not train you regressor. It should be line like

xgbr = xgboost.XGBRegressor()
xgbr.train(X, Y)

Now you can use xgbr.predict(X) ;)

You also need to train explainer:

explainer = shap.TreeExplainer(xgbr)
with warnings.catch_warnings():
     warnings.simplefilter("ignore")
     sh = explainer.shap_values(X)

And now you can select values:

misclassified = (y_pred <= 0.7) & (Y == 1)
shap.decision_plot(expected_value, sh, features_display, link='logit', highlight=misclassified)

Before you use shap I suggest you check how good your regressor fits your data. So for this I suggest you take part of the data for test without using it in training. Then you can evaluate goodness of fit by calculating and comparing the MSE on test and training sets. The larger the difference, the worse your predictor performs.