Find overlapping dates for each ID and create a new row for the overlap

2.9k views Asked by At

I would like to find the overlapping dates for each ID and create a new row with the overlapping dates and also combine the characters (char) for the lines. It is possible that my data will have >2 overlaps and need >2 combinations of characters. eg. ERM

Data:

ID    date1         date2       char
15  2003-04-05  2003-05-06      E
15  2003-04-20  2003-06-20      R
16  2001-01-02  2002-03-04      M
17  2003-03-05  2007-02-22      I   
17  2005-04-15  2014-05-19      C
17  2007-05-15  2008-02-05      I
17  2008-02-05  2012-02-14      M
17  2010-06-07  2011-02-14      V
17  2010-09-22  2014-05-19      P
17  2012-02-28  2013-03-04      R

Output I would like:

ID  date1       date2           char
15  2003-04-05  2003-04-20      E
15  2003-04-20  2003-05-06      ER
15  2003-05-06  2003-06-20      R
16  2001-01-02  2002-03-04      M
17  2003-03-05  2005-04-15      I
17  2005-04-15  2007-02-22      IC
17  2005-04-15  2007-05-15      C   
17  2007-05-15  2008-02-05      CI
17  2008-02-05  2012-02-14      CM
17  2010-06-07  2011-02-14      CV
17  2010-09-22  2014-05-19      CP
17  2012-02-28  2013-03-04      CR
17  2014-05-19  2014-05-19      P
17  2010-06-07  2012-02-14      MV
17  2010-09-22  2011-02-14      VP
17  2012-02-28  2013-03-04      RP

What I have tried: I have tried using subtracting date 2 from the current row from the row below using:

df$diff <- c(NA,df[2:nrow(tdf), "date1"] - df[1:(nrow(df)-1), "date2"])

Then to determine the overlaps between the rows:

df$overlap[which(df$diff<1)] <-1
df$overlap.up <- c(df$overlap[2:(nrow(df))], "NA")
df$overlap.final[which(df$overlap==1 | df$overlap.up==1)] <- 1

I then selected those that had an overlap.final==1 and put them into another dataframe and found the overlaps for each ID.

However, I have realized that this is way too simplistic and flawed, because it only selects overlaps that occur sequentially (using the difference in dates in the first step). What I need to do is to take the series of dates for each ID and loop through each combination to determine if there is an overlap and then, if so, record that start and end date and create a new character “char” signalling what was combined during those two dates. I think I need a loop to do this.

I tried to create a loop to find the overlap intervals between date1 and date 2

df <- df[which(!duplicated(df$ ID)),]

for (i in 1:nrow(df)) {
  tmp <- length(which(df $ID[i] & (df$date1[i] >df$date1 & df$date1[i]< df$date2) | (df$date2[i] < df$date2&  df$date2[i]> df$date1))) >0
  df$int[i]<- tmp

}

However this does not work.

After identifying the overlapping intervals, I need to create new rows for each new start and end date and a new character that represents the overlap.

Another version of the loop I have tried to identify overlaps:

for (i in 1:nrow(df)) {
  if (df$ID[i]==IDs$ID){
  tmp <- length(df, df$ ID[i]==IDs$ & (df$date1[i]> df$date1 & df$date1 [i]< df$date2 | df$date2[i] < df$date2 &  df$date2[i]> df$date1)) >0
  df$int[i]<- tmp
  }
}
3

There are 3 answers

1
daniel.heydebreck On

Introduction

The for-loop you added to your question and the included comparison were a good start. The should be some additional brackets ( and ) in the date comparison. This for-loop-approach automatically considers new rows in the data frame. Therefore, you can get three-, four- and more-character strings in the char column.

Create input data

df = as.data.frame(list('ID'=c(15, 15, 16, 17, 17, 17, 17, 17, 17, 17),
                        'date1'=as.Date(c('2003-04-05', '2003-04-20', '2001-01-02', '2003-03-05', '2005-04-15', '2007-05-15', '2008-02-05', '2010-06-07', '2010-09-22', '2012-02-28')),
                        'date2'=as.Date(c('2003-05-06', '2003-06-20', '2002-03-04', '2007-02-22', '2014-05-19', '2008-02-05', '2012-02-14', '2011-02-14', '2014-05-19', '2013-03-04')),
                        'char'=c('E', 'R', 'M', 'I', 'C', 'I', 'M', 'V', 'P', 'R')),
                   stringsAsFactors=FALSE)

Solution

Iterate all rows (that were existing in the original data.frame) and compare them to all current lines.

nrow_init = nrow(df)
for (i in 1:(nrow(df)-1)) {
  print(i)
  ## get rows of df that have overlapping dates
  ##   (1:nrow(df))>i :: consider only rows below the current row to avoid double processing of two row-pairs
  ##   (!grepl(df$char[i],df$char)) :: prevent double letters
  ## Because we call nrow(df) each time (and not save it as a variable once in the beginning), we consider also new rows here. Therefore, we do not need the specific procedure for comparing 3 or more rows.
  loc = ((1:nrow(df))>i) & (!grepl(df$char[i],df$char)) & (df$ID[i]==df$ID) & (((df$date1[i]>df$date1) & (df$date1[i]<df$date2)) | ((df$date1>df$date1[i]) & (df$date1<df$date2[i])) | ((df$date2[i]<df$date2) & (df$date2[i]>df$date1)) | ((df$date2<df$date2[i]) & (df$date2>df$date1[i])))
  ## Uncomment this line, if you want to compare only two rows each and not more
  # loc = ((1:nrow(df))<=nrow_init) & ((1:nrow(df))>i) & (df$ID[i]==df$ID) & (((df$date1[i]>df$date1) & (df$date1[i]<df$date2)) | ((df$date2[i]<df$date2) & (df$date2[i]>df$date1)))

  ## proceed only of at least one duplicate row was found
  if (sum(loc) > 0) {
    # build new rows
    #  pmax and pmin do element-wise min and max calculation; df$date1[i] and df$date2[i] are automatically extended to the length of df$date1[loc] and df$date2[loc], respectively
    df_append = as.data.frame(list('ID'=df$ID[loc],
                                   'date1'=pmax(df$date1[i],df$date1[loc]),
                                   'date2'=pmin(df$date2[i],df$date2[loc]),
                                   'char'=paste0(df$char[i],df$char[loc])))
    ## append new rows
    df = rbind(df, df_append)
  }
}

## create a new column and sort the characters in it
##  idea for sort: https://stackoverflow.com/a/5904854/4612235
df$sort_char = df$char
for (i in 1:nrow(df)) df$sort_char[i] = paste(sort(unlist(strsplit(df$sort_char[i], ""))), collapse = "")
## remove duplicates
df = df[!duplicated(df[c('ID', 'date1', 'date2', 'sort_char')]),]
## remove additional column
df$sort_char = NULL

Out put

ID      date1      date2 char
15 2003-04-05 2003-05-06    E
15 2003-04-20 2003-06-20    R
16 2001-01-02 2002-03-04    M
17 2003-03-05 2007-02-22    I
17 2005-04-15 2014-05-19    C
17 2007-05-15 2008-02-05    I
17 2008-02-05 2012-02-14    M
17 2010-06-07 2011-02-14    V
17 2010-09-22 2014-05-19    P
17 2012-02-28 2013-03-04    R
15 2003-04-20 2003-05-06   ER
17 2005-04-15 2007-02-22   IC
17 2007-05-15 2008-02-05   CI
17 2008-02-05 2012-02-14   CM
17 2010-06-07 2011-02-14   CV
17 2010-09-22 2014-05-19   CP
17 2012-02-28 2013-03-04   CR
17 2010-06-07 2011-02-14   MV
17 2010-09-22 2012-02-14   MP
17 2010-06-07 2011-02-14  MCV
17 2010-09-22 2012-02-14  MCP
17 2010-09-22 2011-02-14   VP
17 2010-09-22 2011-02-14  VCP
17 2010-09-22 2011-02-14  VMP
17 2010-09-22 2011-02-14 VMCP
17 2012-02-28 2013-03-04   PR
17 2012-02-28 2013-03-04  PCR
4
Uwe On

First, we create a data.table of all possible intervals for each ID.

All possible intervals means that we take all start and end dates of an ID and combine them in a sorted vector tmp. The unique values indicate all possible intersections (or breaks) of all given intervals of the ID on the time axis. For later joining, the breaks are re-arranged in one interval per row with a start and an end column:

library(data.table)
options(datatable.print.class = TRUE)
breaks <- DT[, {
  tmp <- unique(sort(c(date1, date2)))
  .(start = head(tmp, -1L), end = tail(tmp, -1L))
  }, by = ID]
breaks
       ID      start        end
    <int>     <IDat>     <IDat>
 1:    15 2003-04-05 2003-04-20
 2:    15 2003-04-20 2003-05-06
 3:    15 2003-05-06 2003-06-20
 4:    16 2001-01-02 2002-03-04
 5:    17 2003-03-05 2005-04-15
 6:    17 2005-04-15 2007-02-22
 7:    17 2007-02-22 2007-05-15
 8:    17 2007-05-15 2008-02-05
 9:    17 2008-02-05 2010-06-07
10:    17 2010-06-07 2010-09-22
11:    17 2010-09-22 2011-02-14
12:    17 2011-02-14 2012-02-14
13:    17 2012-02-14 2012-02-28
14:    17 2012-02-28 2013-03-04
15:    17 2013-03-04 2014-05-19

Then, a non-equi join is performed whereby the values are aggregated simultaneously on the join conditions (by = .EACHI is called grouping by each i, see this answer for a more detailed explanation):

DT[breaks, on = .(ID, date1 <= start, date2 >= end), paste(char, collapse = ""),  
   by = .EACHI, allow.cartesian = TRUE]
       ID      date1      date2     V1
    <int>     <IDat>     <IDat> <char>
 1:    15 2003-04-05 2003-04-20      E
 2:    15 2003-04-20 2003-05-06     ER
 3:    15 2003-05-06 2003-06-20      R
 4:    16 2001-01-02 2002-03-04      M
 5:    17 2003-03-05 2005-04-15      I
 6:    17 2005-04-15 2007-02-22     IC
 7:    17 2007-02-22 2007-05-15      C
 8:    17 2007-05-15 2008-02-05     CI
 9:    17 2008-02-05 2010-06-07     CM
10:    17 2010-06-07 2010-09-22    CMV
11:    17 2010-09-22 2011-02-14   CMVP
12:    17 2011-02-14 2012-02-14    CMP
13:    17 2012-02-14 2012-02-28     CP
14:    17 2012-02-28 2013-03-04    CPR
15:    17 2013-03-04 2014-05-19     CP

The result differs from the expected result posted by the OP but plotting the data convinces that the above result shows all possible overlaps:

library(ggplot2)
ggplot(DT) + aes(y = char, yend = char, x = date1, xend = date2) + 
  geom_segment() + facet_wrap("ID", ncol = 1L) 

enter image description here

Data

library(data.table)
DT <- fread(
  "ID    date1         date2       char
15  2003-04-05  2003-05-06      E
15  2003-04-20  2003-06-20      R
16  2001-01-02  2002-03-04      M
17  2003-03-05  2007-02-22      I   
17  2005-04-15  2014-05-19      C
17  2007-05-15  2008-02-05      I
17  2008-02-05  2012-02-14      M
17  2010-06-07  2011-02-14      V
17  2010-09-22  2014-05-19      P
17  2012-02-28  2013-03-04      R"
)
cols <- c("date1", "date2")
DT[, (cols) := lapply(.SD, as.IDate), .SDcols = cols]
0
Davis Vaughan On

You can also use dplyr/tidyr along with the ivs package, which is a package dedicated to working with interval vectors like you have here. This allows you to combine your start/end dates into a single interval column and use a variety of iv_*() functions on it, here we use iv_identify_splits().

Understanding iv_identify_splits() can be a little tricky at first, so I'd encourage you to take a look at the graphical representation of that operation here

library(dplyr)
library(tidyr)
library(ivs)

df <- tribble(
  ~ID,       ~date1,       ~date2, ~char,
  15L, "2003-04-05", "2003-05-06",   "E",
  15L, "2003-04-20", "2003-06-20",   "R",
  16L, "2001-01-02", "2002-03-04",   "M",
  17L, "2003-03-05", "2007-02-22",   "I",
  17L, "2005-04-15", "2014-05-19",   "C",
  17L, "2007-05-15", "2008-02-05",   "I",
  17L, "2008-02-05", "2012-02-14",   "M",
  17L, "2010-06-07", "2011-02-14",   "V",
  17L, "2010-09-22", "2014-05-19",   "P",
  17L, "2012-02-28", "2013-03-04",   "R"
) %>%
  mutate(
    date1 = as.Date(date1),
    date2 = as.Date(date2)
  )

# Combine the start/stop endpoints into a single interval vector
df <- df %>%
  mutate(interval = iv(date1, date2), .keep = "unused")

# Note that these are half-open intervals and you may need to adjust the end!
df
#> # A tibble: 10 × 3
#>       ID char                  interval
#>    <int> <chr>               <iv<date>>
#>  1    15 E     [2003-04-05, 2003-05-06)
#>  2    15 R     [2003-04-20, 2003-06-20)
#>  3    16 M     [2001-01-02, 2002-03-04)
#>  4    17 I     [2003-03-05, 2007-02-22)
#>  5    17 C     [2005-04-15, 2014-05-19)
#>  6    17 I     [2007-05-15, 2008-02-05)
#>  7    17 M     [2008-02-05, 2012-02-14)
#>  8    17 V     [2010-06-07, 2011-02-14)
#>  9    17 P     [2010-09-22, 2014-05-19)
#> 10    17 R     [2012-02-28, 2013-03-04)

# For each ID, compute the "splits" for each interval.
# This splits on all the endpoints and returns a list column
df <- df %>%
  group_by(ID) %>%
  mutate(splits = iv_identify_splits(interval))

print(df, n = 3)
#> # A tibble: 10 × 4
#> # Groups:   ID [3]
#>      ID char                  interval           splits
#>   <int> <chr>               <iv<date>> <list<iv<date>>>
#> 1    15 E     [2003-04-05, 2003-05-06)              [2]
#> 2    15 R     [2003-04-20, 2003-06-20)              [2]
#> 3    16 M     [2001-01-02, 2002-03-04)              [1]
#> # … with 7 more rows

# Note how the total range of the splits vector matches the
# range of the corresponding interval
df$interval[[1]]
#> <iv<date>[1]>
#> [1] [2003-04-05, 2003-05-06)
df$splits[[1]]
#> <iv<date>[2]>
#> [1] [2003-04-05, 2003-04-20) [2003-04-20, 2003-05-06)

# From there we can unchop() the splits column so we can group on it
df <- df %>%
  unchop(splits)

# Note how rows 2 and 3 have the same `splits` value, so `E` and `R` will
# go together
df
#> # A tibble: 30 × 4
#> # Groups:   ID [3]
#>       ID char                  interval                   splits
#>    <int> <chr>               <iv<date>>               <iv<date>>
#>  1    15 E     [2003-04-05, 2003-05-06) [2003-04-05, 2003-04-20)
#>  2    15 E     [2003-04-05, 2003-05-06) [2003-04-20, 2003-05-06)
#>  3    15 R     [2003-04-20, 2003-06-20) [2003-04-20, 2003-05-06)
#>  4    15 R     [2003-04-20, 2003-06-20) [2003-05-06, 2003-06-20)
#>  5    16 M     [2001-01-02, 2002-03-04) [2001-01-02, 2002-03-04)
#>  6    17 I     [2003-03-05, 2007-02-22) [2003-03-05, 2005-04-15)
#>  7    17 I     [2003-03-05, 2007-02-22) [2005-04-15, 2007-02-22)
#>  8    17 C     [2005-04-15, 2014-05-19) [2005-04-15, 2007-02-22)
#>  9    17 C     [2005-04-15, 2014-05-19) [2007-02-22, 2007-05-15)
#> 10    17 C     [2005-04-15, 2014-05-19) [2007-05-15, 2008-02-05)
#> # … with 20 more rows

# Group by (ID, splits) and paste the `char` column elements together
df %>%
  group_by(ID, splits) %>%
  summarise(char = paste0(char, collapse = ","), .groups = "drop")
#> # A tibble: 15 × 3
#>       ID                   splits char   
#>    <int>               <iv<date>> <chr>  
#>  1    15 [2003-04-05, 2003-04-20) E      
#>  2    15 [2003-04-20, 2003-05-06) E,R    
#>  3    15 [2003-05-06, 2003-06-20) R      
#>  4    16 [2001-01-02, 2002-03-04) M      
#>  5    17 [2003-03-05, 2005-04-15) I      
#>  6    17 [2005-04-15, 2007-02-22) I,C    
#>  7    17 [2007-02-22, 2007-05-15) C      
#>  8    17 [2007-05-15, 2008-02-05) C,I    
#>  9    17 [2008-02-05, 2010-06-07) C,M    
#> 10    17 [2010-06-07, 2010-09-22) C,M,V  
#> 11    17 [2010-09-22, 2011-02-14) C,M,V,P
#> 12    17 [2011-02-14, 2012-02-14) C,M,P  
#> 13    17 [2012-02-14, 2012-02-28) C,P    
#> 14    17 [2012-02-28, 2013-03-04) C,P,R  
#> 15    17 [2013-03-04, 2014-05-19) C,P

Created on 2022-04-05 by the reprex package (v2.0.1)