Find overlapping dates for each ID and create a new row for the overlap

Question

Find overlapping dates for each ID and create a new row for the overlap

2.9k views Asked by sar At 11 September 2017 at 08:39

I would like to find the overlapping dates for each ID and create a new row with the overlapping dates and also combine the characters (char) for the lines. It is possible that my data will have >2 overlaps and need >2 combinations of characters. eg. ERM

Data:

ID    date1         date2       char
15  2003-04-05  2003-05-06      E
15  2003-04-20  2003-06-20      R
16  2001-01-02  2002-03-04      M
17  2003-03-05  2007-02-22      I   
17  2005-04-15  2014-05-19      C
17  2007-05-15  2008-02-05      I
17  2008-02-05  2012-02-14      M
17  2010-06-07  2011-02-14      V
17  2010-09-22  2014-05-19      P
17  2012-02-28  2013-03-04      R

Output I would like:

ID  date1       date2           char
15  2003-04-05  2003-04-20      E
15  2003-04-20  2003-05-06      ER
15  2003-05-06  2003-06-20      R
16  2001-01-02  2002-03-04      M
17  2003-03-05  2005-04-15      I
17  2005-04-15  2007-02-22      IC
17  2005-04-15  2007-05-15      C   
17  2007-05-15  2008-02-05      CI
17  2008-02-05  2012-02-14      CM
17  2010-06-07  2011-02-14      CV
17  2010-09-22  2014-05-19      CP
17  2012-02-28  2013-03-04      CR
17  2014-05-19  2014-05-19      P
17  2010-06-07  2012-02-14      MV
17  2010-09-22  2011-02-14      VP
17  2012-02-28  2013-03-04      RP

What I have tried: I have tried using subtracting date 2 from the current row from the row below using:

df$diff <- c(NA,df[2:nrow(tdf), "date1"] - df[1:(nrow(df)-1), "date2"])

Then to determine the overlaps between the rows:

df$overlap[which(df$diff<1)] <-1
df$overlap.up <- c(df$overlap[2:(nrow(df))], "NA")
df$overlap.final[which(df$overlap==1 | df$overlap.up==1)] <- 1

I then selected those that had an overlap.final==1 and put them into another dataframe and found the overlaps for each ID.

However, I have realized that this is way too simplistic and flawed, because it only selects overlaps that occur sequentially (using the difference in dates in the first step). What I need to do is to take the series of dates for each ID and loop through each combination to determine if there is an overlap and then, if so, record that start and end date and create a new character “char” signalling what was combined during those two dates. I think I need a loop to do this.

I tried to create a loop to find the overlap intervals between date1 and date 2

df <- df[which(!duplicated(df$ ID)),]

for (i in 1:nrow(df)) {
  tmp <- length(which(df $ID[i] & (df$date1[i] >df$date1 & df$date1[i]< df$date2) | (df$date2[i] < df$date2&  df$date2[i]> df$date1))) >0
  df$int[i]<- tmp

}

However this does not work.

After identifying the overlapping intervals, I need to create new rows for each new start and end date and a new character that represents the overlap.

Another version of the loop I have tried to identify overlaps:

for (i in 1:nrow(df)) {
  if (df$ID[i]==IDs$ID){
  tmp <- length(df, df$ ID[i]==IDs$ & (df$date1[i]> df$date1 & df$date1 [i]< df$date2 | df$date2[i] < df$date2 &  df$date2[i]> df$date1)) >0
  df$int[i]<- tmp
  }
}

Original Q&A

There are 3 answers

**daniel.heydebreck** · Answer 1 · 2017-09-12T08:40:26+00:00

Introduction

The for-loop you added to your question and the included comparison were a good start. The should be some additional brackets ( and ) in the date comparison. This for-loop-approach automatically considers new rows in the data frame. Therefore, you can get three-, four- and more-character strings in the char column.

Create input data

df = as.data.frame(list('ID'=c(15, 15, 16, 17, 17, 17, 17, 17, 17, 17),
                        'date1'=as.Date(c('2003-04-05', '2003-04-20', '2001-01-02', '2003-03-05', '2005-04-15', '2007-05-15', '2008-02-05', '2010-06-07', '2010-09-22', '2012-02-28')),
                        'date2'=as.Date(c('2003-05-06', '2003-06-20', '2002-03-04', '2007-02-22', '2014-05-19', '2008-02-05', '2012-02-14', '2011-02-14', '2014-05-19', '2013-03-04')),
                        'char'=c('E', 'R', 'M', 'I', 'C', 'I', 'M', 'V', 'P', 'R')),
                   stringsAsFactors=FALSE)

Solution

Iterate all rows (that were existing in the original data.frame) and compare them to all current lines.

nrow_init = nrow(df)
for (i in 1:(nrow(df)-1)) {
  print(i)
  ## get rows of df that have overlapping dates
  ##   (1:nrow(df))>i :: consider only rows below the current row to avoid double processing of two row-pairs
  ##   (!grepl(df$char[i],df$char)) :: prevent double letters
  ## Because we call nrow(df) each time (and not save it as a variable once in the beginning), we consider also new rows here. Therefore, we do not need the specific procedure for comparing 3 or more rows.
  loc = ((1:nrow(df))>i) & (!grepl(df$char[i],df$char)) & (df$ID[i]==df$ID) & (((df$date1[i]>df$date1) & (df$date1[i]<df$date2)) | ((df$date1>df$date1[i]) & (df$date1<df$date2[i])) | ((df$date2[i]<df$date2) & (df$date2[i]>df$date1)) | ((df$date2<df$date2[i]) & (df$date2>df$date1[i])))
  ## Uncomment this line, if you want to compare only two rows each and not more
  # loc = ((1:nrow(df))<=nrow_init) & ((1:nrow(df))>i) & (df$ID[i]==df$ID) & (((df$date1[i]>df$date1) & (df$date1[i]<df$date2)) | ((df$date2[i]<df$date2) & (df$date2[i]>df$date1)))

  ## proceed only of at least one duplicate row was found
  if (sum(loc) > 0) {
    # build new rows
    #  pmax and pmin do element-wise min and max calculation; df$date1[i] and df$date2[i] are automatically extended to the length of df$date1[loc] and df$date2[loc], respectively
    df_append = as.data.frame(list('ID'=df$ID[loc],
                                   'date1'=pmax(df$date1[i],df$date1[loc]),
                                   'date2'=pmin(df$date2[i],df$date2[loc]),
                                   'char'=paste0(df$char[i],df$char[loc])))
    ## append new rows
    df = rbind(df, df_append)
  }
}

## create a new column and sort the characters in it
##  idea for sort: https://stackoverflow.com/a/5904854/4612235
df$sort_char = df$char
for (i in 1:nrow(df)) df$sort_char[i] = paste(sort(unlist(strsplit(df$sort_char[i], ""))), collapse = "")
## remove duplicates
df = df[!duplicated(df[c('ID', 'date1', 'date2', 'sort_char')]),]
## remove additional column
df$sort_char = NULL

Out put

ID      date1      date2 char
15 2003-04-05 2003-05-06    E
15 2003-04-20 2003-06-20    R
16 2001-01-02 2002-03-04    M
17 2003-03-05 2007-02-22    I
17 2005-04-15 2014-05-19    C
17 2007-05-15 2008-02-05    I
17 2008-02-05 2012-02-14    M
17 2010-06-07 2011-02-14    V
17 2010-09-22 2014-05-19    P
17 2012-02-28 2013-03-04    R
15 2003-04-20 2003-05-06   ER
17 2005-04-15 2007-02-22   IC
17 2007-05-15 2008-02-05   CI
17 2008-02-05 2012-02-14   CM
17 2010-06-07 2011-02-14   CV
17 2010-09-22 2014-05-19   CP
17 2012-02-28 2013-03-04   CR
17 2010-06-07 2011-02-14   MV
17 2010-09-22 2012-02-14   MP
17 2010-06-07 2011-02-14  MCV
17 2010-09-22 2012-02-14  MCP
17 2010-09-22 2011-02-14   VP
17 2010-09-22 2011-02-14  VCP
17 2010-09-22 2011-02-14  VMP
17 2010-09-22 2011-02-14 VMCP
17 2012-02-28 2013-03-04   PR
17 2012-02-28 2013-03-04  PCR

**Uwe** · Answer 2 · 2017-09-12T09:54:35+00:00

First, we create a data.table of all possible intervals for each ID.

All possible intervals means that we take all start and end dates of an ID and combine them in a sorted vector tmp. The unique values indicate all possible intersections (or breaks) of all given intervals of the ID on the time axis. For later joining, the breaks are re-arranged in one interval per row with a start and an end column:

library(data.table)
options(datatable.print.class = TRUE)
breaks <- DT[, {
  tmp <- unique(sort(c(date1, date2)))
  .(start = head(tmp, -1L), end = tail(tmp, -1L))
  }, by = ID]
breaks

       ID      start        end
    <int>     <IDat>     <IDat>
 1:    15 2003-04-05 2003-04-20
 2:    15 2003-04-20 2003-05-06
 3:    15 2003-05-06 2003-06-20
 4:    16 2001-01-02 2002-03-04
 5:    17 2003-03-05 2005-04-15
 6:    17 2005-04-15 2007-02-22
 7:    17 2007-02-22 2007-05-15
 8:    17 2007-05-15 2008-02-05
 9:    17 2008-02-05 2010-06-07
10:    17 2010-06-07 2010-09-22
11:    17 2010-09-22 2011-02-14
12:    17 2011-02-14 2012-02-14
13:    17 2012-02-14 2012-02-28
14:    17 2012-02-28 2013-03-04
15:    17 2013-03-04 2014-05-19

Then, a non-equi join is performed whereby the values are aggregated simultaneously on the join conditions (by = .EACHI is called grouping by each i, see this answer for a more detailed explanation):

DT[breaks, on = .(ID, date1 <= start, date2 >= end), paste(char, collapse = ""),  
   by = .EACHI, allow.cartesian = TRUE]

       ID      date1      date2     V1
    <int>     <IDat>     <IDat> <char>
 1:    15 2003-04-05 2003-04-20      E
 2:    15 2003-04-20 2003-05-06     ER
 3:    15 2003-05-06 2003-06-20      R
 4:    16 2001-01-02 2002-03-04      M
 5:    17 2003-03-05 2005-04-15      I
 6:    17 2005-04-15 2007-02-22     IC
 7:    17 2007-02-22 2007-05-15      C
 8:    17 2007-05-15 2008-02-05     CI
 9:    17 2008-02-05 2010-06-07     CM
10:    17 2010-06-07 2010-09-22    CMV
11:    17 2010-09-22 2011-02-14   CMVP
12:    17 2011-02-14 2012-02-14    CMP
13:    17 2012-02-14 2012-02-28     CP
14:    17 2012-02-28 2013-03-04    CPR
15:    17 2013-03-04 2014-05-19     CP

The result differs from the expected result posted by the OP but plotting the data convinces that the above result shows all possible overlaps:

library(ggplot2)
ggplot(DT) + aes(y = char, yend = char, x = date1, xend = date2) + 
  geom_segment() + facet_wrap("ID", ncol = 1L)

Data

library(data.table)
DT <- fread(
  "ID    date1         date2       char
15  2003-04-05  2003-05-06      E
15  2003-04-20  2003-06-20      R
16  2001-01-02  2002-03-04      M
17  2003-03-05  2007-02-22      I   
17  2005-04-15  2014-05-19      C
17  2007-05-15  2008-02-05      I
17  2008-02-05  2012-02-14      M
17  2010-06-07  2011-02-14      V
17  2010-09-22  2014-05-19      P
17  2012-02-28  2013-03-04      R"
)
cols <- c("date1", "date2")
DT[, (cols) := lapply(.SD, as.IDate), .SDcols = cols]

**Davis Vaughan** · Answer 3 · 2022-04-05T14:36:33+00:00

You can also use dplyr/tidyr along with the ivs package, which is a package dedicated to working with interval vectors like you have here. This allows you to combine your start/end dates into a single interval column and use a variety of iv_*() functions on it, here we use iv_identify_splits().

Understanding iv_identify_splits() can be a little tricky at first, so I'd encourage you to take a look at the graphical representation of that operation here

library(dplyr)
library(tidyr)
library(ivs)

df <- tribble(
  ~ID,       ~date1,       ~date2, ~char,
  15L, "2003-04-05", "2003-05-06",   "E",
  15L, "2003-04-20", "2003-06-20",   "R",
  16L, "2001-01-02", "2002-03-04",   "M",
  17L, "2003-03-05", "2007-02-22",   "I",
  17L, "2005-04-15", "2014-05-19",   "C",
  17L, "2007-05-15", "2008-02-05",   "I",
  17L, "2008-02-05", "2012-02-14",   "M",
  17L, "2010-06-07", "2011-02-14",   "V",
  17L, "2010-09-22", "2014-05-19",   "P",
  17L, "2012-02-28", "2013-03-04",   "R"
) %>%
  mutate(
    date1 = as.Date(date1),
    date2 = as.Date(date2)
  )

# Combine the start/stop endpoints into a single interval vector
df <- df %>%
  mutate(interval = iv(date1, date2), .keep = "unused")

# Note that these are half-open intervals and you may need to adjust the end!
df
#> # A tibble: 10 × 3
#>       ID char                  interval
#>    <int> <chr>               <iv<date>>
#>  1    15 E     [2003-04-05, 2003-05-06)
#>  2    15 R     [2003-04-20, 2003-06-20)
#>  3    16 M     [2001-01-02, 2002-03-04)
#>  4    17 I     [2003-03-05, 2007-02-22)
#>  5    17 C     [2005-04-15, 2014-05-19)
#>  6    17 I     [2007-05-15, 2008-02-05)
#>  7    17 M     [2008-02-05, 2012-02-14)
#>  8    17 V     [2010-06-07, 2011-02-14)
#>  9    17 P     [2010-09-22, 2014-05-19)
#> 10    17 R     [2012-02-28, 2013-03-04)

# For each ID, compute the "splits" for each interval.
# This splits on all the endpoints and returns a list column
df <- df %>%
  group_by(ID) %>%
  mutate(splits = iv_identify_splits(interval))

print(df, n = 3)
#> # A tibble: 10 × 4
#> # Groups:   ID [3]
#>      ID char                  interval           splits
#>   <int> <chr>               <iv<date>> <list<iv<date>>>
#> 1    15 E     [2003-04-05, 2003-05-06)              [2]
#> 2    15 R     [2003-04-20, 2003-06-20)              [2]
#> 3    16 M     [2001-01-02, 2002-03-04)              [1]
#> # … with 7 more rows

# Note how the total range of the splits vector matches the
# range of the corresponding interval
df$interval[[1]]
#> <iv<date>[1]>
#> [1] [2003-04-05, 2003-05-06)
df$splits[[1]]
#> <iv<date>[2]>
#> [1] [2003-04-05, 2003-04-20) [2003-04-20, 2003-05-06)

# From there we can unchop() the splits column so we can group on it
df <- df %>%
  unchop(splits)

# Note how rows 2 and 3 have the same `splits` value, so `E` and `R` will
# go together
df
#> # A tibble: 30 × 4
#> # Groups:   ID [3]
#>       ID char                  interval                   splits
#>    <int> <chr>               <iv<date>>               <iv<date>>
#>  1    15 E     [2003-04-05, 2003-05-06) [2003-04-05, 2003-04-20)
#>  2    15 E     [2003-04-05, 2003-05-06) [2003-04-20, 2003-05-06)
#>  3    15 R     [2003-04-20, 2003-06-20) [2003-04-20, 2003-05-06)
#>  4    15 R     [2003-04-20, 2003-06-20) [2003-05-06, 2003-06-20)
#>  5    16 M     [2001-01-02, 2002-03-04) [2001-01-02, 2002-03-04)
#>  6    17 I     [2003-03-05, 2007-02-22) [2003-03-05, 2005-04-15)
#>  7    17 I     [2003-03-05, 2007-02-22) [2005-04-15, 2007-02-22)
#>  8    17 C     [2005-04-15, 2014-05-19) [2005-04-15, 2007-02-22)
#>  9    17 C     [2005-04-15, 2014-05-19) [2007-02-22, 2007-05-15)
#> 10    17 C     [2005-04-15, 2014-05-19) [2007-05-15, 2008-02-05)
#> # … with 20 more rows

# Group by (ID, splits) and paste the `char` column elements together
df %>%
  group_by(ID, splits) %>%
  summarise(char = paste0(char, collapse = ","), .groups = "drop")
#> # A tibble: 15 × 3
#>       ID                   splits char   
#>    <int>               <iv<date>> <chr>  
#>  1    15 [2003-04-05, 2003-04-20) E      
#>  2    15 [2003-04-20, 2003-05-06) E,R    
#>  3    15 [2003-05-06, 2003-06-20) R      
#>  4    16 [2001-01-02, 2002-03-04) M      
#>  5    17 [2003-03-05, 2005-04-15) I      
#>  6    17 [2005-04-15, 2007-02-22) I,C    
#>  7    17 [2007-02-22, 2007-05-15) C      
#>  8    17 [2007-05-15, 2008-02-05) C,I    
#>  9    17 [2008-02-05, 2010-06-07) C,M    
#> 10    17 [2010-06-07, 2010-09-22) C,M,V  
#> 11    17 [2010-09-22, 2011-02-14) C,M,V,P
#> 12    17 [2011-02-14, 2012-02-14) C,M,P  
#> 13    17 [2012-02-14, 2012-02-28) C,P    
#> 14    17 [2012-02-28, 2013-03-04) C,P,R  
#> 15    17 [2013-03-04, 2014-05-19) C,P

^{Created on 2022-04-05 by the reprex package (v2.0.1)}

TechQA.

Find overlapping dates for each ID and create a new row for the overlap

There are 3 answers

Introduction

Create input data

Solution

Out put

Data

Related Questions in R

Related Questions in DATE

Related Questions in OVERLAP

Related Questions in OVERLAPPING-MATCHES

Popular Questions

Popular Tags

Trending Questions