Linked Questions

Popular Questions

Conditionally update rows and then group

Asked by At

Let me start by providing my sample dataset:

ID Start        Code End          Days
1  2016-03-01   A    2016-03-14   14
1  2016-03-01   A    2016-03-14   14
1  2016-03-01   B    2016-04-01   30
2  2016-02-01   A    2016-03-01   28

I'd like to, for each ID, and within this group, for each Code, check if the End is larger dan Start in the next row (if df$End[i] > df$Start[i+1]) and if so, update Start of the next row to End en recompute End (which is Start + Days) for that row i+1. The results should thus be:

ID Start        Code End          Days
1  2016-03-01   A    2016-03-14   14
1  2016-03-14   A    2016-03-28   14
1  2016-03-01   B    2016-04-01   30
2  2016-02-01   A    2016-03-01   28

Afterwards, if for a ID, and a Code the difference between df$End[i] - df$Start[i+1] <= 7, I would like to combine the rows, using the smallest df$Start and the largest df$End for this subset. Making:

ID Start        Code End          Days
1  2016-03-01   A    2016-03-28   14
1  2016-03-01   B    2016-04-01   30
2  2016-02-01   A    2016-03-01   28

Since my dataset is over 100M rows, I'd like a fast solution. Unfortunately I am pretty new to dplyr, so help is highly appreciated!

update: larger example:

ID  Start       Code  End         Days
1   2012-04-01  A     2012-04-07  7
1   2016-03-01  B     2016-03-15  15
1   2016-03-01  B     2016-05-29  90
1   2016-06-01  B     2016-08-29  90
1   2016-09-01  B     2016-11-29  90
1   2016-12-01  B     2017-02-28  90
1   2017-03-01  B     2017-05-09  90
1   2017-08-01  B     2017-10-29  90
1   2017-12-01  B     2018-02-28  90
2   2016-04-01  B     2016-04-14  14

This results in:

ID   Start       Code   End   
1    2012-04-01  A      2012-04-07
1    2016-03-01  B      2017-02-28
1    2017-03-01  B      2017-05-29
1    2018-08-01  B      2017-12-05
2    2016-04-01  B      2016-04-14

Where I would expect row 2 and to be combined.

For the first step I tried:

 grouped_df <-
 df %>%
 group_by(ID, Code) %>%
 mutate_at(vars(Start, End), funs(as.Date)) %>%
 mutate(new_start = as.Date(ifelse(lag(End > Start), lag(End), Start), origin="1970-01-01")) %>%

mutate(new_stop= new_disp + Days)

However, if a new_end has been computed, we should now compare new_end and not End with new_start (and not Start).

Related Questions