I am trying to understand how to calculate the expected time for the each of my ids in my dataset. I have a dataset that looks like a Dataframe shaped (500,4):
ids var1 var2 churn time
0 1.738434 324 0 21.0
1 1.541176 12 0 4.0
2 2.049281 753 1 5.0
3 1.929860 563 0 16.0
4 1.595027 22 0 5.0
... ... ... ... ...
let's take lifelines
to calculate the expected value using either predict_expectation
or by taking the median
of the survival function for each ID.
Part 1: Calculate the expected values
cph = CoxPHFitter()
cph.fit(data,"time","churn")
censored_df = data[data["churn"]==0]
cph.predict_expectation(censored_df) #conditional_after=censored_df["time"])
#or
cph.predict_median(censored_df) #conditional_after=censored_df["time"])
For scikit-survival is calculated using the predict_survival_function()
Concordance index = 0.82
Part 2: Compare the results with the actuals
So now I have created a table using both methods: predict_expectation()
("expected" column) and predict_median
("median" column) that looks like this:
for scikit-survival it can only be calculated by taking the median (please not that I am aware that for other algorithms in lifelines\scikit-learn might be different, but focus on the idea)
ids churn time expected diff_expectation median diff_median
0 0 21.0 21.526222 0.526222 8.0 -13.0
1 0 4.0 21.819911 17.819911 13.0 9.0
3 0 16.0 23.189344 7.189344 9.0 -7.0
4 0 5.0 22.090598 17.090598 12.0 7.0
6 0 8.0 21.545022 13.545022 10.0 2.0
... ... ... ... ... ... ...
The columns with "diff" represent the difference between the respective predicted column and "time"
Questions
Why are the expected times so off?
Is there anything wrong with approach? Should I predict in the whole data (censored+uncensored) or just with the censored? (I have tried the three possible permutations, only censored, only uncensored, both, and it is still off). My understanding is that if the survival curve for each ID converge to 0 (uncensored data) you can calculate using area under the curve, if it is censored you need to use the median of the surv curve. (I have done the above calculation keeping that in mind)
How can I achieve a closer estimate?
if run the experiment and fit the model only on uncensored data and then predict on that same uncensored data, should you be getting a very close estimate, right? Well This is not the case. You should be able to check this by taking the average from the expected medians and it should be similar to the median of the actual values, right? Or you can check taking the mean of the "diff" column to see if it at least averages to 0, but this is not the case, which shows some potential bias in the model
Why does the
predict_expectation
outputs something different to thepredict_median
? Which one is more recommended to use?
This phenomenons happens with any dataset, you can try replicating this example using the from lifelines.datasets import load_leukemia
dataset, even if you get a 0.9 in your concordance index, this still happens.
Here are a few resources that I found that sort of explain this, but I don't fully understand it, if someone can break it down a bit more, that would be great.
Sources
- https://github.com/sebp/scikit-survival/issues/94
- https://github.com/sebp/scikit-survival/issues/190
- https://scikit-survival.readthedocs.io/en/latest/user_guide/understanding_predictions.html
- https://lifelines.readthedocs.io/en/latest/fitters/regression/CoxPHFitter.html#lifelines.fitters.coxph_fitter.CoxPHFitter.predict_expectation
you can find a fully coded exampled here: https://github.com/felipe0216/survival_examples/blob/main/predict_expectation_scikit.py
This article gives a nice explanation as the differences between Expectation and Median as a way to predict survival time.
Basically, the Expectation is a good prediction only if the data you're dealing with eventually reaches survival probability
S(t)=0
, because if it doesn't the expectation (calculated from the area under the line) will be infinity.In this case, the median (the time at which probability crosses 0.5) would be more appropriate. However, sometimes we might have data which doesn't ever reach
S(t)=0.5
.So I think the answer is that it depends.