How to plot misclassified samples in shap?

Question

How to plot misclassified samples in shap?

1.6k views Asked by DN1 At 09 October 2020 at 15:42

I have a dataset of genes scored between 0 to 1 of the likelihood of causing disease (genes scored at 1 are known to cause disease, and a gene scored at say 0.74 is probable to cause disease). I am trying to build a machine learning model to predict that disease score for new genes in regression classification.

I want to look at a shap decision plot for the genes which are known disease genes but are being scored low (e.g. genes that are scored 1 but my model is scoring less than 0.8). I am struggling to group these genes together to plot.

My data looks like:

X:
Index   Feature1  Feature2   ... FeatureN
Gene1     1           0.2          10
Gene2     1           0.1          7
Gene3     0           0.3          10
#index is actually the index and not a column

Y:
Score
1
0.6
0.4

I run an xgboost regressor with nested cross validation and look at the MSE, predicted r2, and plot the observed vs expected values. I can see in the observed vs expected plot that the genes scored 1 in Y have many low scores predicted by the model, I want to understand why the model is doing this with using shap. I cannot give example data unfortunately.

I am trying to adapt example shap code given for label classification:

import shap

xgbr = xgboost.XGBRegressor()
xgbr.fit(X_train, Y_train)

select = range(8) #I have 8 features after feature selection with BorutaShap
features = X.iloc[select]
features_display = X.loc[features.index]

explainer = shap.TreeExplainer(xgbr)
expected_value = explainer.expected_value

#Example code from https://slundberg.github.io/shap/notebooks/plots/decision_plot.html: 

y_pred = xgbr.predict(X) 
y_pred = (shap_values.sum(1) + expected_value) > 0
misclassified = y_pred != y_test[select]
shap.decision_plot(expected_value, shap_values, features_display, link='logit', highlight=misclassified)

How do I select y_pred so it is predictions/genes that were meant to be 1 but were actually lower than 0.8 (or any low number)?

edit: in response to the given answer I have tried:

explainer = shap.TreeExplainer(xgbr)
shap_values = explainer.shap_values(X_test)

y_pred = xgbr.predict(X_test)
m = (y_pred <= 0.5) & (Y_test == 1)

shap.initjs()
shap.decision_plot(explainer.expected_value, shap_values,  X_test[m],  return_objects=True)

This runs but m is a length of 171 (the entire number of rows in my Y_test data) and the plot then plots all 171 it looks like - and I know from looking at the data there should be only one gene that is <= 0.5 and but is actually scored at 1.

Original Q&A

There are 2 answers

igrinis On 14 October 2020 at 14:10

Since I don't have your dataset, I can't check the code, but here some thoughts that may show you the direction.

It seem that you do not train you regressor. It should be line like

xgbr = xgboost.XGBRegressor()
xgbr.train(X, Y)

Now you can use xgbr.predict(X) ;)

You also need to train explainer:

explainer = shap.TreeExplainer(xgbr)
with warnings.catch_warnings():
     warnings.simplefilter("ignore")
     sh = explainer.shap_values(X)

And now you can select values:

misclassified = (y_pred <= 0.7) & (Y == 1)
shap.decision_plot(expected_value, sh, features_display, link='logit', highlight=misclassified)

Before you use shap I suggest you check how good your regressor fits your data. So for this I suggest you take part of the data for test without using it in training. Then you can evaluate goodness of fit by calculating and comparing the MSE on test and training sets. The larger the difference, the worse your predictor performs.

**yatu** · Accepted Answer · 2020-10-14T15:08:47+00:00

First of all, you mention predict that disease score for new genes in regression classification, what do you mean? The output seems to be binary, 0 or 1, hence this is a binary classification problem. You should be using xgboost's classifier instead. Update: Let's assume a regression problem though, as per the comments, to simulate your case. Though for the example bellow we should set 'objective':'multi:softmax' to output the actual labels.

As per your question, it seems that what you're trying to do is to index the test set on those samples that where not correctly predicted, and analyse the misleading features, which makes a reasonable amount of sense.

Let's reproduce your problem with some sample dataset:

from sklearn.datasets import load_iris

from sklearn.model_selection import train_test_split
import shap
import xgboost

X,y = shap.datasets.iris()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

model = xgboost.train(params={"learning_rate": 0.01}, 
                      dtrain=xgboost.DMatrix(X_train, label=y_train), 
                      num_boost_round =100)

A SHAP plot using the entire test set, is straight forward. For a force_plot for instance:

explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)

shap.initjs()
shap.force_plot(explainer.expected_value, shap_values, X_test)

Now, if we want to do the same with the missclassified samples, we need to look at the output probabilities. Since the iris dataset has several classes, let's say we want to visuallize the force_plot for those samples that should've been classified as 2, but we instead have an output value bellow 1.7:

y_pred = model.predict(xgboost.DMatrix(X_test))
m = (y_pred <= 1.7) & (y_test == 2)

Now we lets use the mask to perform boolean indexing on the X_test set, and update the shap_values:

shap.initjs()
c= explainer.shap_values(X_test[m])
shap.force_plot(explainer.expected_value, shap_values, X_test[m])

This is telling us that it is the petal length and width that are mostly pushing the regression to a higher value. Hence they presumably are the variables that are playing a major part in the misclassification.

Similarly, for a decision_plot:

shap.decision_plot(explainer.expected_value, shap_values, 
                   X_test[m], feature_order='hclust', 
                   return_objects=True)

TechQA.

How to plot misclassified samples in shap?

There are 2 answers

Related Questions in PYTHON

Related Questions in PANDAS

Related Questions in MACHINE-LEARNING

Related Questions in PLOT

Related Questions in SHAP

Popular Questions

Popular Tags

Trending Questions