I have a genetic dataset where the index of a row is the name of the gene. I am looking to also find the row number of any given gene so I can look at genes individually after they've gone through a machine learning model prediction - to interpret the gene's prediction in shap. How I code for the shap plot currently needs a row number to pull out the specific gene.
My data looks like this:
Index Feature1 Feature2 ... FeatureN
Gene1 1 0.2 10
Gene2 1 0.1 7
Gene3 0 0.3 10
For example if I want to pull out and view model prediction of Gene3
I do this:
import shap
shap.initjs()
xgbr = xgboost.XGBRegressor()
def shap_plot(j):
explainerModel = shap.TreeExplainer(xgbr)
shap_values_Model = explainerModel.shap_values(X_train)
p = shap.force_plot(explainerModel.expected_value, shap_values_Model[j], X_train.iloc[[j]],feature_names=df.columns)
return(p)
shap_plot(3)
Doing shap_plot(3)
is a problem for me as I do not actually know if the gene I want is in row 3 in the shuffled training or testing data.
Is there a way to pull out the row number from a known Gene index? Or potentially re-code my shap plot so it does accept my string indices? I have a biology background so any guidance would be appreciated.
Try the following. df is your dataframe and result will give you the row number (first row will result 1, etc) for a given gene