I am working with a clinical data table that contains hospital episodes for a few hundred thousand people across an approx 15 year period. The table is formatted as one row per episode, each of which contains the episode number, date of the episode (date & episodes do not match chronologically for some reason) and binary indicators for 56 diagnoses of interest that may have occurred during each episode.
Short reproducible sample below...
df <- data.frame(
id = c("1001", "1001", "1001", "1001", "1001", "1001",
"1002", "1002", "1002", "1002", "1002",
"1003", "1003", "1003", "1003", "1003", "1003", "1003",
"1005", "1005", "1005", "1005", "1005", "1005", "1005", "1005", "1005"),
episode = c(0, 1, 2, 3, 4, 5,
0, 1, 3, 4, 5,
0, 2, 3, 6, 7, 9, 10,
1, 2, 3, 4, 5, 6, 7, 8, 9),
date = sample(seq(as.Date('2010/01/01'), as.Date('2016/01/01'), by="day"), 27),
diag_a = c(1, 1, 0, 0, 1, 1,
0, 0, 0, 0, 0,
0, 0, 1, 1, 1, 1, 1,
0, 0, 0, 0, 0, 0, 0, 0, 0),
diag_b = c(0, 0, 0, 0, 0, 0,
1, 1, 1, 1, 1,
0, 0, 0, 0, 0, 0, 0,
0, 0, 1, 0, 0, 0, 0, 0, 0),
diag_c = c(1, 0, 0, 1, 0, 0,
0, 0, 0, 0, 0,
0, 0, 1, 1, 1, 1, 1,
0, 0, 0, 0, 0, 0, 0, 0, 0),
diag_d = c(1, 1, 0, 0, 1, 1,
0, 0, 0, 0, 0,
0, 0, 1, 1, 1, 1, 1,
0, 0, 0, 0, 1, 0, 0, 0, 0)
)
I am hoping to achieve the following, but struggling:
- Identify if any of the 56 diagnoses appear across any of the episodes and extracts the earliest date of diagnosis of the condition.
- Create a new binary indicator that indicates whether the participant has two or more of the 56 diagnoses across their series of hospital episodes (i.e. a 'multimorbidity' 0/1 column) and generate a new date column for when 'multimorbidity' is first achieved (i.e. date of second diagnosis of interest [which could happen within a single episode]). The dataset contained data about thousands of diagnoses, but I am only interested in these 56 contributing to multimorbidity.
- Count and sums into a new column (e.g. 'condition_count') how many of the 56 diagnoses appear across the episodes for each ID.
I'm getting lost on where to start with this, probably because I am trying to achieve too many things in one go. Ultimately, the most important part is to extract if each ID had multimorbidity (2+ diagnoses) and the date that this occurred, for use in subsequent regressions.
If I understand what you want, here is how you could generate each of the results:
For the first one, I would pivot the data to longer on all the relevant diagnosis variables, identify only those that were actually diagnosed, group by the condition and then use
slice_min()to take the smallest date for eachid-conditionpair.For the second one, start with the previous dataset, then arrange by
idanddateand calculate the cumulative sum of diagnoses. In particular, yo want the date where the second diagnosis happened, so you can filter to where the cumulative sum of diagnoses is 2 and then merge those data back on to the original one.Finally, for the third one, start with the second dataset then picot the diagnosis columns to longer again. If you group by
idandcondition, you can find the maximum ofdiag- whether they ever had the condition or not. Then, you can group byidand sum the number of conditions and merge it back on to the original data.Created on 2024-03-07 with reprex v2.0.2