Creating a binary variable with an attached date from a series of other dated episodes (long format)

24 views Asked by At

I am working with a clinical data table that contains hospital episodes for a few hundred thousand people across an approx 15 year period. The table is formatted as one row per episode, each of which contains the episode number, date of the episode (date & episodes do not match chronologically for some reason) and binary indicators for 56 diagnoses of interest that may have occurred during each episode.

Short reproducible sample below...

df <- data.frame(
  id = c("1001", "1001", "1001", "1001", "1001", "1001", 
         "1002", "1002", "1002", "1002", "1002",
         "1003", "1003", "1003", "1003", "1003", "1003", "1003",
         "1005", "1005", "1005", "1005", "1005", "1005", "1005", "1005", "1005"),
  episode = c(0, 1, 2, 3, 4, 5,
              0, 1, 3, 4, 5,
              0, 2, 3, 6, 7, 9, 10,
              1, 2, 3, 4, 5, 6, 7, 8, 9),
  date = sample(seq(as.Date('2010/01/01'), as.Date('2016/01/01'), by="day"), 27),
    diag_a = c(1, 1, 0, 0, 1, 1,
                    0, 0, 0, 0, 0,
                    0, 0, 1, 1, 1, 1, 1,
                    0, 0, 0, 0, 0, 0, 0, 0, 0),
  diag_b = c(0, 0, 0, 0, 0, 0,
                    1, 1, 1, 1, 1,
                    0, 0, 0, 0, 0, 0, 0,
                    0, 0, 1, 0, 0, 0, 0, 0, 0),
  diag_c = c(1, 0, 0, 1, 0, 0,
                    0, 0, 0, 0, 0,
                    0, 0, 1, 1, 1, 1, 1,
                    0, 0, 0, 0, 0, 0, 0, 0, 0),
  diag_d = c(1, 1, 0, 0, 1, 1,
                    0, 0, 0, 0, 0,
                    0, 0, 1, 1, 1, 1, 1,
                    0, 0, 0, 0, 1, 0, 0, 0, 0)
)

I am hoping to achieve the following, but struggling:

  1. Identify if any of the 56 diagnoses appear across any of the episodes and extracts the earliest date of diagnosis of the condition.
  2. Create a new binary indicator that indicates whether the participant has two or more of the 56 diagnoses across their series of hospital episodes (i.e. a 'multimorbidity' 0/1 column) and generate a new date column for when 'multimorbidity' is first achieved (i.e. date of second diagnosis of interest [which could happen within a single episode]). The dataset contained data about thousands of diagnoses, but I am only interested in these 56 contributing to multimorbidity.
  3. Count and sums into a new column (e.g. 'condition_count') how many of the 56 diagnoses appear across the episodes for each ID.

I'm getting lost on where to start with this, probably because I am trying to achieve too many things in one go. Ultimately, the most important part is to extract if each ID had multimorbidity (2+ diagnoses) and the date that this occurred, for use in subsequent regressions.

1

There are 1 answers

2
DaveArmstrong On

If I understand what you want, here is how you could generate each of the results:

library(dplyr)
library(tidyr)
set.seed(1234)
df <- data.frame(
  id = c("1001", "1001", "1001", "1001", "1001", "1001", 
         "1002", "1002", "1002", "1002", "1002",
         "1003", "1003", "1003", "1003", "1003", "1003", "1003",
         "1005", "1005", "1005", "1005", "1005", "1005", "1005", "1005", "1005"),
  episode = c(0, 1, 2, 3, 4, 5,
              0, 1, 3, 4, 5,
              0, 2, 3, 6, 7, 9, 10,
              1, 2, 3, 4, 5, 6, 7, 8, 9),
  date = sample(seq(as.Date('2010/01/01'), as.Date('2016/01/01'), by="day"), 27),
  diag_a = c(1, 1, 0, 0, 1, 1,
             0, 0, 0, 0, 0,
             0, 0, 1, 1, 1, 1, 1,
             0, 0, 0, 0, 0, 0, 0, 0, 0),
  diag_b = c(0, 0, 0, 0, 0, 0,
             1, 1, 1, 1, 1,
             0, 0, 0, 0, 0, 0, 0,
             0, 0, 1, 0, 0, 0, 0, 0, 0),
  diag_c = c(1, 0, 0, 1, 0, 0,
             0, 0, 0, 0, 0,
             0, 0, 1, 1, 1, 1, 1,
             0, 0, 0, 0, 0, 0, 0, 0, 0),
  diag_d = c(1, 1, 0, 0, 1, 1,
             0, 0, 0, 0, 0,
             0, 0, 1, 1, 1, 1, 1,
             0, 0, 0, 0, 1, 0, 0, 0, 0)
)

For the first one, I would pivot the data to longer on all the relevant diagnosis variables, identify only those that were actually diagnosed, group by the condition and then use slice_min() to take the smallest date for each id-condition pair.

d1 <- df %>% 
  pivot_longer(contains("diag_"), names_to = "condition", values_to = "diag") %>% 
  filter(diag == 1) %>% 
  group_by(id, condition) %>% 
  slice_min(date, n=1) 
d1
#> # A tibble: 9 × 5
#> # Groups:   id, condition [9]
#>   id    episode date       condition  diag
#>   <chr>   <dbl> <date>     <chr>     <dbl>
#> 1 1001        1 2011-09-15 diag_a        1
#> 2 1001        3 2011-02-04 diag_c        1
#> 3 1001        1 2011-09-15 diag_d        1
#> 4 1002        1 2010-09-27 diag_b        1
#> 5 1003        3 2011-04-24 diag_a        1
#> 6 1003        3 2011-04-24 diag_c        1
#> 7 1003        3 2011-04-24 diag_d        1
#> 8 1005        3 2012-10-10 diag_b        1
#> 9 1005        5 2015-05-06 diag_d        1

For the second one, start with the previous dataset, then arrange by id and date and calculate the cumulative sum of diagnoses. In particular, yo want the date where the second diagnosis happened, so you can filter to where the cumulative sum of diagnoses is 2 and then merge those data back on to the original one.

d2 <- df %>% 
  pivot_longer(contains("diag_"), names_to = "condition", values_to = "diag") %>% 
  filter(diag == 1) %>% 
  group_by(id, condition) %>% 
  slice_min(date, n=1) %>% 
  ungroup %>% 
  arrange(id, date) %>% 
  group_by(id) %>% 
  mutate(cs_diag = cumsum(diag)) %>% 
  filter(cs_diag == 2) %>% 
  select(id, date) %>% 
  rename(mm_date = date) %>% 
  right_join(df) 
#> Joining with `by = join_by(id)`
d2
#> # A tibble: 27 × 8
#> # Groups:   id [4]
#>    id    mm_date    episode date       diag_a diag_b diag_c diag_d
#>    <chr> <date>       <dbl> <date>      <dbl>  <dbl>  <dbl>  <dbl>
#>  1 1001  2011-09-15       0 2012-09-30      1      0      1      1
#>  2 1001  2011-09-15       1 2011-09-15      1      0      0      1
#>  3 1001  2011-09-15       2 2012-07-22      0      0      0      0
#>  4 1001  2011-09-15       3 2011-02-04      0      0      1      0
#>  5 1001  2011-09-15       4 2015-11-16      1      0      0      1
#>  6 1001  2011-09-15       5 2014-06-14      1      0      0      1
#>  7 1003  2011-04-24       0 2013-05-03      0      0      0      0
#>  8 1003  2011-04-24       2 2012-09-22      0      0      0      0
#>  9 1003  2011-04-24       3 2011-04-24      1      0      1      1
#> 10 1003  2011-04-24       6 2011-09-26      1      0      1      1
#> # ℹ 17 more rows

Finally, for the third one, start with the second dataset then picot the diagnosis columns to longer again. If you group by id and condition, you can find the maximum of diag - whether they ever had the condition or not. Then, you can group by id and sum the number of conditions and merge it back on to the original data.

d3 <- d2 %>% 
  select(id, contains("diag")) %>%
  pivot_longer(-id, names_to = "condition", values_to = "diag") %>% 
  group_by(id, condition) %>% 
  summarise(diag = max(diag)) %>% 
  group_by(id) %>% 
  summarise(condition_count = sum(diag)) %>% 
  right_join(d2)
#> `summarise()` has grouped output by 'id'. You can override using the `.groups`
#> argument.
#> Joining with `by = join_by(id)`
d3
#> # A tibble: 27 × 9
#>    id    condition_count mm_date    episode date       diag_a diag_b diag_c
#>    <chr>           <dbl> <date>       <dbl> <date>      <dbl>  <dbl>  <dbl>
#>  1 1001                3 2011-09-15       0 2012-09-30      1      0      1
#>  2 1001                3 2011-09-15       1 2011-09-15      1      0      0
#>  3 1001                3 2011-09-15       2 2012-07-22      0      0      0
#>  4 1001                3 2011-09-15       3 2011-02-04      0      0      1
#>  5 1001                3 2011-09-15       4 2015-11-16      1      0      0
#>  6 1001                3 2011-09-15       5 2014-06-14      1      0      0
#>  7 1002                1 NA               0 2013-01-07      0      1      0
#>  8 1002                1 NA               1 2010-09-27      0      1      0
#>  9 1002                1 NA               3 2015-08-14      0      1      0
#> 10 1002                1 NA               4 2014-08-12      0      1      0
#> # ℹ 17 more rows
#> # ℹ 1 more variable: diag_d <dbl>

Created on 2024-03-07 with reprex v2.0.2