Data Formatting for Time Varying Covariate Cox Proportional Hazards Modeling in R

Question

Data Formatting for Time Varying Covariate Cox Proportional Hazards Modeling in R

2.8k views Asked by Entropy At 13 November 2014 at 02:16

I am attempting to develop a time varying Cox proportional hazards (CPH) model in R and was wondering if anyone has generated any code to help format data for the counting structure that is used in time varying / time dependent CPH models.

To make the problem reproducible and somewhat simpler, I have extracted the first 100 rows of data, which features 4 variables (id, date, y, and x). The id is a unique subject identifier. The date is an integer sequence from 0 to n days of observation for each id. y is the status or outcome of the hazard analysis and x is the time varying covariate. In this example, once y = 1 has occurred the data for each subject will be censored and no additional data should be included in the ideal output dataframe.

The data are structured so that each subject has 1 row that corresponds to each day of observation.

head(test)
id date y x
1     0 0 0
1     1 0 1
1     2 0 1
1     3 0 1
1     4 0 1
1     5 0 0

However, as I understand it, the cph function in R requires that time varying covariates be structured in such a way that the start and end variables need to be recoded into 3 rows with intervals from (0,1] and (1,5] and (5,6] for the data featured in the head(test) code block above.

The first 100 rows of data can be reconstructed using this code:

dput(test)
structure(list(id = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 
3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 
5, 5, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 
8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 9, 
9, 9, 9), date = c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 
13, 14, 15, 16, 17, 18, 19, 0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 2, 
3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 
3, 4, 5, 6, 7, 8, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 7, 
8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 
0, 1, 2, 3, 4, 5, 6, 7, 8), y = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 1, 0, 0, 0), x = c(0L, 1L, 1L, 1L, 1L, 0L, 1L, 0L, 0L, 
1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 
0L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 0L, 0L, 0L)), .Names = c("id", 
"date", "y", "x"), row.names = c(NA, -100L), class = "data.frame")

Ideally, I am trying to recode these data so that the output would be:

head(ideal_output)
id start end y x
1      0   1 0 0
1      1   5 0 1
1      5   6 0 0
1      6   7 0 1
1      7   9 0 0
1      9  11 0 1
1     11  20 0 0
2      0   8 0 0
3      0   1 0 0
3      1   3 0 1
3      3   4 0 0
3      4   6 0 1
3      6   7 1 1
4      0   2 0 0
4      2   4 0 1
4      4   7 0 0
5      0   9 0 0
6      0   7 0 0
7      0   1 0 0
7      1   2 0 1
7      2   3 0 0 
7      3   4 1 0
8      0   3 0 0
8      3   4 1 1
9      0   2 0 0
9      2   5 0 1
9      5   6 1 1

I have done this manually to create the ideal_output above but it is an error prone process and untenable for the hundreds of id's and several covariates that I need to evaluate. Consequently, any help would be greatly appreciated in developing an automated way to approach this data formatting challenge. Thanks!

Original Q&A

There are 3 answers

**Linda Ejlskov** · Answer 1 · 2014-12-04T11:02:55+00:00

Linda Ejlskov On 04 December 2014 at 11:02

I think the Survsplit() function is the answer to your problem.

look at: http://www.rdocumentation.org/packages/eha/functions/SurvSplit

Alternatively, try to google: Chapter 5 Extended and Stratified Cox - nus.edu.sg

**Ham** · Answer 2 · 2016-06-20T13:47:53+00:00

Ham On 20 June 2016 at 13:47

I think the tmerge() function is the answer to your problem.

look at: https://cran.r-project.org/web/packages/survival/vignettes/timedep.pdf

**Benjamin Christoffersen** · Answer 3 · 2017-10-21T18:55:31+00:00

As @Ham suggest you can use tmerge. Here is an example

> #####
> # `dat` is the data.frame you provided 
> library(survival)
> 
> # make baseline data.frame for tmerge
> baseline <- by(dat, dat$id, function(x){
+     n <- nrow(x)
+     # avoid slow data.frame call
+     structure(list(
+       id = x$id[1], start = x$date[1], x = x$x[1], end = x$date[n], 
+       dummy = 0),
+       row.names = 1L, class = "data.frame")
+   })
> baseline <- do.call(rbind, baseline)
> baseline # show baseline data
  id start x end dummy
1  1     0 0  19     0
2  2     0 0   7     0
3  3     0 0  12     0
4  4     0 0   6     0
5  5     0 0   8     0
6  6     0 0   6     0
7  7     0 0  11     0
8  8     0 0  14     0
9  9     0 0   8     0
> 
> # use tmerge
> final_dat <- tmerge(baseline, baseline, id = id, y = event(end, dummy))
> final_dat <- tmerge(
+   final_dat, dat, id = id, y = cumtdc(date, y), x = tdc(date, x))
> final_dat[final_dat$id == 3, ] # look at one example
   id start x end dummy tstart tstop y
27  3     0 0  12     0      0     1 0
28  3     0 1  12     0      1     2 0
29  3     0 1  12     0      2     3 0
30  3     0 0  12     0      3     4 0
31  3     0 1  12     0      4     5 0
32  3     0 1  12     0      5     6 0
33  3     0 1  12     0      6     7 1
34  3     0 1  12     0      7     8 1
35  3     0 1  12     0      8     9 1
36  3     0 1  12     0      9    10 1
37  3     0 1  12     0     10    11 1
38  3     0 0  12     0     11    12 1
> 
> # remove values where y is not zero or y is not the first non-zero value
> final_dat <- within(final_dat, ycum <- unlist(tapply(y, id, cumsum)))
> final_dat <- final_dat[final_dat$ycum < 2, ]
> final_dat$ycum <- NULL
> final_dat[final_dat$id == 3, ]
   id start x end dummy tstart tstop y
27  3     0 0  12     0      0     1 0
28  3     0 1  12     0      1     2 0
29  3     0 1  12     0      2     3 0
30  3     0 0  12     0      3     4 0
31  3     0 1  12     0      4     5 0
32  3     0 1  12     0      5     6 0
33  3     0 1  12     0      6     7 1
> 
> # remove x row where the previous x value do match. But
> #  * keep those where y = 1
> #  * update tstop for the last row where the last row may be removed
> final_dat <- within(
+   final_dat,
+   max_t <- unlist(tapply(tstop, id, function(z) rep(max(z), length(z))))) 
> final_dat <- within(
+   final_dat, 
+   keep <- unlist(tapply(x, id, function(z)
+     c(TRUE, z[-1] != z[-length(z)]))))
> 
> final_dat <- final_dat[final_dat$keep | final_dat$y, ]
> 
> final_dat <- within(
+   final_dat, is_last <- unlist(tapply(id, id, function(z) 
+     seq_along(z) == length(z))))
> 
> needs_update <- final_dat$is_last & !final_dat$y
> final_dat[needs_update, "tstop"] <- 
+   final_dat[needs_update, "max_t"]  + 1
> 
> # have to update the tstop column 
> final_dat <- within(final_dat, tstop <- unlist(by(
+   cbind(tstart, tstop), id, function(z) {
+     n <- nrow(z)
+     c(z$tstart[-1], z$tstop[n])
+ })))
> 
> # show final data.frame
> final_dat[, c("id", "tstart", "tstop", "y", "x")]
   id tstart tstop y x
1   1      0     1 0 0
2   1      1     5 0 1
6   1      5     6 0 0
7   1      6     7 0 1
8   1      7     9 0 0
10  1      9    11 0 1
12  1     11    20 0 0
20  2      0     8 0 0
27  3      0     1 0 0
28  3      1     3 0 1
30  3      3     4 0 0
31  3      4     6 0 1
33  3      6     7 1 1
39  4      0     2 0 0
41  4      2     4 0 1
43  4      4     7 0 0
45  5      0     9 0 0
53  6      0     7 0 0
59  7      0     1 0 0
60  7      1     2 0 1
61  7      2     3 0 0
62  7      3     4 1 0
70  8      0     3 0 0
73  8      3     4 1 1
84  9      0     2 0 0
86  9      2     5 0 1
89  9      5     6 1 1

The code after tmerge can be done faster with dplyr or data.table. If you have more columns than just one, x, then I suggest that you: 1) store a column index of dat and use that in tmerge in the tdc function instead of x. Then merge the tables afterwards with merge. Further, you need to update the line that makes the keep indicator. Otherwise the code should be identical.

TechQA.

Data Formatting for Time Varying Covariate Cox Proportional Hazards Modeling in R

There are 3 answers

Related Questions in R

Related Questions in TIME

Related Questions in SURVIVAL-ANALYSIS

Related Questions in COX-REGRESSION

Popular Questions

Popular Tags

Trending Questions