How to find the row number from a character index in python?

306 views Asked by At

I have a genetic dataset where the index of a row is the name of the gene. I am looking to also find the row number of any given gene so I can look at genes individually after they've gone through a machine learning model prediction - to interpret the gene's prediction in shap. How I code for the shap plot currently needs a row number to pull out the specific gene.

My data looks like this:

Index   Feature1  Feature2   ... FeatureN
Gene1     1           0.2          10
Gene2     1           0.1          7
Gene3     0           0.3          10

For example if I want to pull out and view model prediction of Gene3 I do this:

import shap
shap.initjs()

xgbr = xgboost.XGBRegressor()

def shap_plot(j):
    explainerModel = shap.TreeExplainer(xgbr)
    shap_values_Model = explainerModel.shap_values(X_train)
    p = shap.force_plot(explainerModel.expected_value, shap_values_Model[j], X_train.iloc[[j]],feature_names=df.columns)
    return(p)

shap_plot(3)

Doing shap_plot(3) is a problem for me as I do not actually know if the gene I want is in row 3 in the shuffled training or testing data.

Is there a way to pull out the row number from a known Gene index? Or potentially re-code my shap plot so it does accept my string indices? I have a biology background so any guidance would be appreciated.

3

There are 3 answers

0
IoaTzimas On BEST ANSWER

Try the following. df is your dataframe and result will give you the row number (first row will result 1, etc) for a given gene

list(df.index).index('Gene3')+1

#result

3
0
Dan On

There are a lot of ways to get the row number associated with either an index value or a column value.

If your genes are actually in a column called "Index" for example, you can do this:

x_train[x_train["Index"] == "gene3"].index + 1

and if not, you can always get to that by calling reset_index() on your dataframe.

Another options is to just make a new column on your dataframe that goes from 1 - n, for example

mapping = x_train.assign(index_number=range(x_train.shape[0]))["index_number"]

Now mapping should look like this:

Index   index_mapping 
Gene1     0           
Gene2     1           
Gene3     2           

and calling mapping["Gene2"] should return 1.

In addition to this, I notice you're using force plots. I recommend you read this article on why shap has replaced them with decision plots.

Also, you are rebuilding the tree explainer every time you call you function.This is very inefficient, why not rather build it once, and then query it many times:

class ShapPlotter:
    def __init__(self, model, x_train):
        self.explainer_model = shap.TreeExplainer(model)
        self.shap_values_Model = self.explainer_model.shap_values(x_train)
        self.gene_index_mapping = x_train.assign(index_value=range(x_train.shape[0]))["index_value"]

    def plot(gene):
        idx = self._get_index(gene)
        shap_plot = shap.force_plot(...) # replace j with idx here
        return shap_plot

    def _get_index(gene: str) -> int:
        # your choice of method here. e.g. https://stackoverflow.com/a/64279019/1011724
        # in this case, I built a mapping series in the __init__ fn so you can get the index number by just indexing directly with the gene string: 
        return self.gene_index_mapping.loc[gene]
0
wwnde On
list(df[df.Index=='Gene3'].index)