Splitting overlapping rows, within groups, based on dates

290 views Asked by At

I'm trying to create new rows based on overlapping time periods of existing rows. For example, I'd like to turn this:

Customer_Product <- data.table(Customer=c("A01","A01","A01", "A02", "A02", "A02", "A03", "A03", "A03"), 
                Product=c("Prod1","Prod2","Prod3","Prod1","Prod2","Prod3","Prod1","Prod2","Prod3"), 
                Start_Date=c("1/1/2015", "3/1/2015", "4/1/2015", "1/1/2015", "3/1/2015", "4/1/2015", "1/1/2015", "3/1/2015", "4/1/2015"),
                End_Date=c("2/1/2015","5/1/2015","5/1/2015","2/1/2015","5/1/2015","6/1/2015","2/1/2015","6/1/2015","5/1/2015"))
   Customer Product Start_Date End_Date
1:      A01   Prod1   1/1/2015 2/1/2015
2:      A01   Prod2   3/1/2015 5/1/2015
3:      A01   Prod3   4/1/2015 5/1/2015
4:      A02   Prod1   1/1/2015 2/1/2015
5:      A02   Prod2   3/1/2015 5/1/2015
6:      A02   Prod3   4/1/2015 6/1/2015
7:      A03   Prod1   1/1/2015 2/1/2015
8:      A03   Prod2   3/1/2015 6/1/2015
9:      A03   Prod3   4/1/2015 5/1/2015

Into something like this:

Customer_Product_Combo <- data.table(Customer=c("A01","A01","A01", "A02", "A02", "A02", "A02","A03", "A03","A03","A03"),
                Product_or_Combination=c("Prod1","Prod2","Prod2/Prod3","Prod1","Prod2","Prod2/Prod3","Prod3","Prod1","Prod2","Prod2/Prod3","Prod2"),
                Start_Date=c("1/1/2015","3/1/2015","4/1/2015","1/1/2015","3/1/2015","4/1/2015","5/1/2015","1/1/2015","3/1/2015","4/1/2015","5/1/2015"),
                End_Date=c("2/1/2015","4/1/2015","5/1/2015","2/1/2015","4/1/2015","5/1/2015","6/1/2015","2/1/2015","4/1/2015","5/1/2015","6/1/2015"))
    Customer Product_or_Combination Start_Date End_Date
 1:      A01                  Prod1   1/1/2015 2/1/2015
 2:      A01                  Prod2   3/1/2015 4/1/2015
 3:      A01            Prod2/Prod3   4/1/2015 5/1/2015
 4:      A02                  Prod1   1/1/2015 2/1/2015
 5:      A02                  Prod2   3/1/2015 4/1/2015
 6:      A02            Prod2/Prod3   4/1/2015 5/1/2015
 7:      A02                  Prod3   5/1/2015 6/1/2015
 8:      A03                  Prod1   1/1/2015 2/1/2015
 9:      A03                  Prod2   3/1/2015 4/1/2015
10:      A03            Prod2/Prod3   4/1/2015 5/1/2015
11:      A03                  Prod2   5/1/2015 6/1/2015

I've been looking into IRanges, because it seems like disjoin() may be a possible solution, but I can't see any way to inherit/merge the "Prod" data.

I've also been trying to sketch out something using lead/lag in dplyr followed by a gather/merge cycle, but it's also worth noting that I could have instances where more than 2 "Prod"s overlap, and then the logic just gets messy.

Is there a reasonable way to do this? Any help is greatly appreciated!

1

There are 1 answers

1
AntoniosK On BEST ANSWER

I'm using the data you posted (as a data.frame)

Customer_Product <- data.frame(Customer=c("A01","A01","A01", "A02", "A02", "A02", "A03", "A03", "A03"), 
                               Product=c("Prod1","Prod2","Prod3","Prod1","Prod2","Prod3","Prod1","Prod2","Prod3"), 
                               Start_Date=c("1/1/2015", "3/1/2015", "4/1/2015", "1/1/2015", "3/1/2015", "4/1/2015", "1/1/2015", "3/1/2015", "4/1/2015"),
                               End_Date=c("2/1/2015","5/1/2015","5/1/2015","2/1/2015","5/1/2015","6/1/2015","2/1/2015","6/1/2015","5/1/2015"))

Here's a possible solution:

library(tidyverse)
library(data.table)
library(lubridate)

Customer_Product %>%
  mutate_at(vars(matches("Date")), dmy) %>%                          # update to date columns (if needed)
  mutate(day = map2(Start_Date, End_Date, ~seq(.x, .y, "day"))) %>%  # create sequence of days between start and end
  unnest() %>%                                                       # unnest data
  group_by(Customer, day) %>%                                        # for each customer and day
  summarise(Product = paste0(Product, collapse = "/")) %>%           # find corresponding products
  group_by(Customer, Product, id = rleid(Product)) %>%               # for each customer, product combination and position of product combination
  summarise(Start_Date = min(day),                                   # get start date
            End_Date = max(day)) %>%                                 # get end date
  ungroup() %>%                                                      # ungroup
  select(-id) %>%                                                    # remove id column
  arrange(Customer, Start_Date)                                      # order rows (if needed)


# # A tibble: 11 x 4
#   Customer Product     Start_Date End_Date  
#   <fct>    <chr>       <date>     <date>    
# 1 A01      Prod1       2015-01-01 2015-01-02
# 2 A01      Prod2       2015-01-03 2015-01-03
# 3 A01      Prod2/Prod3 2015-01-04 2015-01-05
# 4 A02      Prod1       2015-01-01 2015-01-02
# 5 A02      Prod2       2015-01-03 2015-01-03
# 6 A02      Prod2/Prod3 2015-01-04 2015-01-05
# 7 A02      Prod3       2015-01-06 2015-01-06
# 8 A03      Prod1       2015-01-01 2015-01-02
# 9 A03      Prod2       2015-01-03 2015-01-03
#10 A03      Prod2/Prod3 2015-01-04 2015-01-05
#11 A03      Prod2       2015-01-06 2015-01-06

Note that this solution doesn't allow for date range overlap in your output table.

For example, if you have Prod2/Prod3 during 4/1/2015 - 5/1/2015 you won't get Prod2 during 5/1/2015 - 6/1/2015, but 6/1/2015 - 6/1/2015, as 5/1/2015 is covered in Prod2/Prod3.