Let me start by providing my sample dataset:
ID Start Code End Days
1 2016-03-01 A 2016-03-14 14
1 2016-03-01 A 2016-03-14 14
1 2016-03-01 B 2016-04-01 30
2 2016-02-01 A 2016-03-01 28
I'd like to, for each ID, and within this group, for each Code, check if the End is larger dan Start in the next row (if df$End[i] > df$Start[i+1]) and if so, update Start of the next row to End en recompute End (which is Start + Days) for that row i+1. The results should thus be:
ID Start Code End Days
1 2016-03-01 A 2016-03-14 14
1 2016-03-14 A 2016-03-28 14
1 2016-03-01 B 2016-04-01 30
2 2016-02-01 A 2016-03-01 28
Afterwards, if for a ID, and a Code the difference between df$End[i] - df$Start[i+1] <= 7, I would like to combine the rows, using the smallest df$Start and the largest df$End for this subset. Making:
ID Start Code End Days
1 2016-03-01 A 2016-03-28 14
1 2016-03-01 B 2016-04-01 30
2 2016-02-01 A 2016-03-01 28
Since my dataset is over 100M rows, I'd like a fast solution. Unfortunately I am pretty new to dplyr, so help is highly appreciated!
update: larger example:
ID Start Code End Days
1 2012-04-01 A 2012-04-07 7
1 2016-03-01 B 2016-03-15 15
1 2016-03-01 B 2016-05-29 90
1 2016-06-01 B 2016-08-29 90
1 2016-09-01 B 2016-11-29 90
1 2016-12-01 B 2017-02-28 90
1 2017-03-01 B 2017-05-09 90
1 2017-08-01 B 2017-10-29 90
1 2017-12-01 B 2018-02-28 90
2 2016-04-01 B 2016-04-14 14
This results in:
ID Start Code End
1 2012-04-01 A 2012-04-07
1 2016-03-01 B 2017-02-28
1 2017-03-01 B 2017-05-29
1 2018-08-01 B 2017-12-05
2 2016-04-01 B 2016-04-14
Where I would expect row 2 and to be combined.
For the first step I tried:
grouped_df <-
df %>%
group_by(ID, Code) %>%
mutate_at(vars(Start, End), funs(as.Date)) %>%
mutate(new_start = as.Date(ifelse(lag(End > Start), lag(End), Start), origin="1970-01-01")) %>%
mutate(new_stop= new_disp + Days)
However, if a new_end has been computed, we should now compare new_end and not End with new_start (and not Start).