I have a dataset of genes scored between 0 to 1 of the likelihood of causing disease (genes scored at 1 are known to cause disease, and a gene scored at say 0.74 is probable to cause disease). I am trying to build a machine learning model to predict that disease score for new genes in regression classification.
I want to look at a shap decision plot for the genes which are known disease genes but are being scored low (e.g. genes that are scored 1 but my model is scoring less than 0.8). I am struggling to group these genes together to plot.
My data looks like:
X:
Index Feature1 Feature2 ... FeatureN
Gene1 1 0.2 10
Gene2 1 0.1 7
Gene3 0 0.3 10
#index is actually the index and not a column
Y:
Score
1
0.6
0.4
I run an xgboost regressor with nested cross validation and look at the MSE, predicted r2, and plot the observed vs expected values. I can see in the observed vs expected plot that the genes scored 1 in Y have many low scores predicted by the model, I want to understand why the model is doing this with using shap. I cannot give example data unfortunately.
I am trying to adapt example shap code given for label classification:
import shap
xgbr = xgboost.XGBRegressor()
xgbr.fit(X_train, Y_train)
select = range(8) #I have 8 features after feature selection with BorutaShap
features = X.iloc[select]
features_display = X.loc[features.index]
explainer = shap.TreeExplainer(xgbr)
expected_value = explainer.expected_value
#Example code from https://slundberg.github.io/shap/notebooks/plots/decision_plot.html:
y_pred = xgbr.predict(X)
y_pred = (shap_values.sum(1) + expected_value) > 0
misclassified = y_pred != y_test[select]
shap.decision_plot(expected_value, shap_values, features_display, link='logit', highlight=misclassified)
How do I select y_pred
so it is predictions/genes that were meant to be 1 but were actually lower than 0.8 (or any low number)?
edit: in response to the given answer I have tried:
explainer = shap.TreeExplainer(xgbr)
shap_values = explainer.shap_values(X_test)
y_pred = xgbr.predict(X_test)
m = (y_pred <= 0.5) & (Y_test == 1)
shap.initjs()
shap.decision_plot(explainer.expected_value, shap_values, X_test[m], return_objects=True)
This runs but m
is a length of 171 (the entire number of rows in my Y_test data) and the plot then plots all 171 it looks like - and I know from looking at the data there should be only one gene that is <= 0.5 and but is actually scored at 1.
First of all, you mention predict that disease score for new genes in regression classification, what do you mean? The output seems to be binary,
0
or1
, hence this is a binary classification problem. You should be usingxgboost
's classifier instead. Update: Let's assume a regression problem though, as per the comments, to simulate your case. Though for the example bellow we should set'objective':'multi:softmax'
to output the actual labels.As per your question, it seems that what you're trying to do is to index the test set on those samples that where not correctly predicted, and analyse the misleading features, which makes a reasonable amount of sense.
Let's reproduce your problem with some sample dataset:
A SHAP plot using the entire test set, is straight forward. For a
force_plot
for instance:Now, if we want to do the same with the missclassified samples, we need to look at the output probabilities. Since the iris dataset has several classes, let's say we want to visuallize the
force_plot
for those samples that should've been classified as2
, but we instead have an output value bellow1.7
:Now we lets use the mask to perform boolean indexing on the
X_test
set, and update theshap_values
:This is telling us that it is the petal length and width that are mostly pushing the regression to a higher value. Hence they presumably are the variables that are playing a major part in the misclassification.
Similarly, for a
decision_plot
: