I am attempting to develop a time varying Cox proportional hazards (CPH) model in R and was wondering if anyone has generated any code to help format data for the counting structure that is used in time varying / time dependent CPH models.
To make the problem reproducible and somewhat simpler, I have extracted the first 100 rows of data, which features 4 variables (id
, date
, y
, and x
). The id
is a unique subject identifier. The date
is an integer sequence from 0 to n days of observation for each id
. y
is the status or outcome of the hazard analysis and x
is the time varying covariate. In this example, once y
= 1 has occurred the data for each subject will be censored and no additional data should be included in the ideal output dataframe.
The data are structured so that each subject has 1 row that corresponds to each day of observation.
head(test)
id date y x
1 0 0 0
1 1 0 1
1 2 0 1
1 3 0 1
1 4 0 1
1 5 0 0
However, as I understand it, the cph
function in R requires that time varying covariates be structured in such a way that the start
and end
variables need to be recoded into 3 rows with intervals from (0,1] and (1,5] and (5,6] for the data featured in the head(test)
code block above.
The first 100 rows of data can be reconstructed using this code:
dput(test)
structure(list(id = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5,
5, 5, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 9,
9, 9, 9), date = c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 2,
3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2,
3, 4, 5, 6, 7, 8, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 7,
8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,
0, 1, 2, 3, 4, 5, 6, 7, 8), y = c(0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 1, 0, 0, 0), x = c(0L, 1L, 1L, 1L, 1L, 0L, 1L, 0L, 0L,
1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L,
0L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 0L, 0L, 0L)), .Names = c("id",
"date", "y", "x"), row.names = c(NA, -100L), class = "data.frame")
Ideally, I am trying to recode these data so that the output would be:
head(ideal_output)
id start end y x
1 0 1 0 0
1 1 5 0 1
1 5 6 0 0
1 6 7 0 1
1 7 9 0 0
1 9 11 0 1
1 11 20 0 0
2 0 8 0 0
3 0 1 0 0
3 1 3 0 1
3 3 4 0 0
3 4 6 0 1
3 6 7 1 1
4 0 2 0 0
4 2 4 0 1
4 4 7 0 0
5 0 9 0 0
6 0 7 0 0
7 0 1 0 0
7 1 2 0 1
7 2 3 0 0
7 3 4 1 0
8 0 3 0 0
8 3 4 1 1
9 0 2 0 0
9 2 5 0 1
9 5 6 1 1
I have done this manually to create the ideal_output
above but it is an error prone process and untenable for the hundreds of id
's and several covariates that I need to evaluate. Consequently, any help would be greatly appreciated in developing an automated way to approach this data formatting challenge. Thanks!
I think the Survsplit() function is the answer to your problem.
look at: http://www.rdocumentation.org/packages/eha/functions/SurvSplit
Alternatively, try to google: Chapter 5 Extended and Stratified Cox - nus.edu.sg