R Creating a time-varying survival dataset from event data

1.1k views Asked by At

I want to create a survival dataset featuring multiple-record ids. The existing event data consists of one row observations with the date formatted as dd/mm/yy. The idea is to count the number of consecutive months where there is at least one event/month (there are multiple years, so this has to be accounted for somehow). In other words, I want to create episodes that capture such monthly streaks, including periods of inactivity. To give an example, the code should transform something like this:

df1
id        event.date
group1    01/01/16
group1    05/02/16
group1    07/03/16
group1    10/06/16
group1    12/09/16

to this:

df2
id        t0    t1    ep.no   ep.t   ep.type
group1    1     3     1       3      1  
group1    4     5     2       2      0
group1    6     6     3       1      1
group1    7     8     4       2      0
group1    9     9     5       1      1
group1    10    ...   ...     ...    ...

where t0 and t1 are the start and end months, ep.no is the episode counter for the particular id, ep.t is the length of that particular episode, and ep.type indicates the type of episode (active/inactive). In the example above, there is an initial three-months of activity, then a two-month break, followed by a single-month episode of relapse etc.

I am mostly concerned about the transformation that brings about the t0 and t1 from df1 to df2, as the other variables in df2 can be constructed afterwards based on them (e.g. no is a counter, time is arithmetic, and type always starts out as 1 and alternates). Given the complexity of the problem (at least for me), I get the need to provide the actual data, but I am not sure if that is allowed? I will see what I can do if a mod chimes in.

1

There are 1 answers

4
atiretoo On

I think this does what you want. The trick is identifying the sequence of observations that need to be treated together, and using dplyr::lag with cumsum is the way to go.

# Convert to date objects, summarize by month, insert missing months
library(tidyverse)
library(lubridate)

# added rows of data to demonstrate that it works with 
# > id and > 1 event per month and rolls across year end
df1 <- read_table("id        event.date
group1    01/01/16
group1    02/01/16
group1    05/02/16
group1    07/03/16
group1    10/06/16
group1    12/09/16
group1    01/02/17
group2    01/01/16
group2    05/02/16
group2    07/03/16",col_types="cc")

# need to get rid of extra whitespace, but automatically converts to date
# summarize by month to count events per month
df1.1 <- mutate(df1, event.date=dmy(event.date),
              yr=year(event.date),
              mon=month(event.date))

# get down to one row per event and complete data
df2 <- group_by(df1.1,id,yr,mon) %>%
  summarize(events=n()) %>%
  complete(id, yr, mon=1:12, fill=list(events=0)) %>%
  group_by(id) %>%
  mutate(event = as.numeric(events >0),
    is_start=lag(event,default=-1)!=event,
    episode=cumsum(is_start), 
    episode.date=ymd(paste(yr,mon,1,sep="-"))) %>%
  group_by(id, episode) %>%
  summarize(t0 = first(episode.date),
            t1 = last(episode.date) %m+% months(1),
            ep.length = as.numeric((last(episode.date) %m+% months(1)) - first(episode.date)),
            ep.type = first(event)) 

Gives

Source: local data frame [10 x 6]
Groups: id [?]

       id episode         t0         t1 ep.length ep.type
    <chr>   <int>     <dttm>     <dttm>     <dbl>   <dbl>
1  group1       1 2016-01-01 2016-04-01        91       1
2  group1       2 2016-04-01 2016-06-01        61       0
3  group1       3 2016-06-01 2016-07-01        30       1
4  group1       4 2016-07-01 2016-09-01        62       0
5  group1       5 2016-09-01 2016-10-01        30       1
6  group1       6 2016-10-01 2017-02-01       123       0
7  group1       7 2017-02-01 2017-03-01        28       1
8  group1       8 2017-03-01 2018-01-01       306       0
9  group2       1 2016-01-01 2016-04-01        91       1
10 group2       2 2016-04-01 2017-01-01       275       0

Using complete() with mon=1:12 will always make the last episode stretch to the end of that year. The solution would be to insert a filter() on yr and mon after complete()

The advantage of keeping t0 and t1 as Date-time objects is that they work correctly across year boundaries, which using month numbers won't.

Session information:

R version 3.3.2 (2016-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets 
[6] methods   base     

other attached packages:
[1] lubridate_1.3.3 dplyr_0.5.0     purrr_0.2.2    
[4] readr_0.2.2     tidyr_0.6.0     tibble_1.2     
[7] ggplot2_2.2.0   tidyverse_1.0.0

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.8      knitr_1.15.1     magrittr_1.5    
 [4] munsell_0.4.2    colorspace_1.2-6 R6_2.1.3        
 [7] stringr_1.1.0    highr_0.6        plyr_1.8.4      
[10] tools_3.3.2      grid_3.3.2       gtable_0.2.0    
[13] DBI_0.5          lazyeval_0.2.0   assertthat_0.1  
[16] digest_0.6.10    memoise_1.0.0    evaluate_0.10   
[19] stringi_1.1.2    scales_0.4.1