I am currently working on obtaining survival probabilities from Cox proportional hazard models. Initially, I utilized the coxph function, but for large data, it proved to be slow when predicting at all time points for all different individuals in the data. I tried using the cph function from the rms package which significantly improved the computational speed. However, I noticed slight differences in the predicted probabilities between the two methods. Specifically, at smaller time points, the differences are up to about 5 decimal places, while at higher time points, they can reach up to 2 decimal places. I set parameters to be the same and even obtained similar coefficient estimates.
Here are the codes from both functions:
From the coxph function:
S_t <- coxph(Surv(time = Y, event = d) ~ ns(age)+ns(psa)+tstage+deyo+gs+insurance+income+education+race, ties="breslow", data = data)
S_t_pred.obj <- survfit(S_t, stype=1, newdata=X, se.fit = FALSE)
S_t_pred.sum <- summary(S_t_pred.obj, times=Y, extend = TRUE)$surv
From the cph function:
S_t1 <- cph(Surv(time = Y, event = d) ~ ns(age)+ns(psa)+tstage+deyo+gs+insurance+income+education+race, method="breslow", data = data, surv=T)
S_t_pred.sum1 <- survest(S_t1, stype=1, times=Y, newdata=X, se.fit = FALSE, extend = TRUE)$surv
I would greatly appreciate any insights or guidance on why these differences may be occurring. Thank you for your time.
Looking at the source code for each function (https://github.com/harrelfe/rms/blob/master/R/cph.s / https://github.com/cran/survival/blob/master/R/coxph.R), there are some options you can adjust to get the same output, e.g.
Created on 2023-11-17 with reprex v2.0.2
Does that work with your actual data?