Determine presence and extent of overlapping date ranges by ID number - two data frames

673 views Asked by At

I have two data frames as follows. They are of unequal length:

library(lubridate)

id <- c(1, 2, 2, 2, 2, 3, 4, 4, 6, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9,
    10, 10, 10, 11, 11, 12, 13, 14, 15, 15, 5451396, 5451396, 5451396, 5451396, 5451396)
admDt <- ymd(c("2000-02-24", "2000-04-30", "2000-06-06", "2001-01-29", "2004-06-10", "2001-05-21",
           "2000-01-25", "2000-04-18", "2000-01-14", "1991-10-06", "1992-02-25", "2000-05-17",
           "2003-06-06", "2009-02-16", "2000-01-23", "2000-03-10", "2000-04-05", "2000-06-16",
           "2000-07-04", "2000-07-27", "2001-01-19", "2002-08-16", "2002-09-19", "2004-04-17",
           "2005-08-02", "2005-09-21", "2006-07-10", "2000-02-24", "2000-05-05", "2000-08-29",
           "2001-01-24", "2000-01-27", "2000-03-09", "2000-04-15", "2000-03-20", "2002-11-13",
           "2000-06-28", "2000-07-02", "2000-06-13", "1999-12-27", "2008-09-10", "2000-04-09",
           "2000-06-01", "2002-11-25", "2006-08-04", "2006-10-07"))
sepDt <- ymd(c("2000-02-25", "2000-05-25", "2000-06-06", "2001-02-15", "2004-07-12", "2001-06-01",
           "2000-01-31", "2000-04-20", "2000-01-31", "1991-11-07", "1992-03-26", "2000-05-31",
           "2003-06-17", "2009-02-23", "2000-03-06", "2000-03-17", "2000-04-06", "2000-06-28",
           "2000-07-17", "2000-07-31", "2002-04-19", "2002-09-11", "2003-05-06", "2004-05-03",
           "2005-08-31", "2006-05-29", "2009-06-19", "2000-03-09", "2000-05-06", "2000-09-12",
           "2001-01-24", "2000-02-15", "2000-03-17", "2000-04-16", "2000-04-20", "2002-12-05",
           "2000-07-27", "2000-08-15", "2000-06-22", "2000-02-12", "2008-09-17", "2000-05-26",
           "2000-08-29", "2003-02-24", "2006-09-22", "2006-11-10"))
adm <- data.frame(id, admDt, sepDt)

id <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 5451396)
birthDt <- ymd(c("1971-07-22", "1982-08-09", "1976-01-30", "1972-02-03", "1958-05-26", "1979-05-24",
             "1971-11-03", "1980-02-05", "1978-06-08", "1969-10-14", "1962-01-01", "1977-03-09",
             "1952-01-24", "1974-12-16", "1956-05-05", "1963-07-16"))
dxDt <- ymd(c("2000-02-24", "2000-04-30", "2000-03-03", "2000-01-31", "2000-06-20", "2000-12-13",
          "2000-05-14", "2000-01-23", "2000-03-09", "2000-02-15", "2000-05-01", "2000-06-30",
          "2000-08-15", "2000-06-22", "2000-01-27", "2000-06-01"))
admPreDx <- c("No", "No", "No", "Yes", "No", "No", "No", "No", "Yes", "Yes","Yes", "Yes", "Yes",
          "Yes", "Yes", "Yes")
admPreDxNbr <- c(0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1)
admPreDxDur <- c(0, 0, 0, 6, 0, 0, 0, 0, 14, 19, 20, 2, 31, 9, 31, 25)
admPostDx <- c("Yes", "Yes", "No", "No", "No", "No", "Yes", "Yes", "No", "Yes", "No", "Yes", "No",
           "No", "Yes", "Yes")
admPostDxNbr <- c(1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 3)
admPostDxDur <- c(1, 25, 0, 0, 0, 0, 14, 31, 0, 6, 0, 27, 0, 0, 16, 31)
admDx <- data.frame(id, birthDt, dxDt, admPreDx, admPreDxNbr, admPreDxDur, admPostDx, admPostDxNbr,
                admPostDxDur)


> head(adm)
  id      admDt      sepDt
1  1 2000-02-24 2000-02-25
2  2 2000-04-30 2000-05-25
3  2 2000-06-06 2000-06-06
4  2 2001-01-29 2001-02-15
5  2 2004-06-10 2004-07-12
6  3 2001-05-21 2001-06-01

> head(admDx)
  id    birthDt       dxDt admPreDx admPreDxNbr admPreDxDur admPostDx admPostDxNbr admPostDxDur
1  1 1971-07-22 2000-02-24       No           0           0       Yes            1            1
2  2 1982-08-09 2000-04-30       No           0           0       Yes            1           25
3  3 1976-01-30 2000-03-03       No           0           0        No            0            0
4  4 1972-02-03 2000-01-31      Yes           1           6        No            0            0
5  5 1958-05-26 2000-06-20       No           0           0        No            0            0
6  6 1979-05-24 2000-12-13       No           0           0        No            0            0

The actual datasets range from 10,000 to 1,000,000+ rows.

Each row in adm refers to a discrete hospital admission. Note: id is the patient's ID number, while admDt and sepDt refer to the admission and discharge dates, respectively. Some patient's have multiple admissions.

Each row in admDx refers to a single patient: id is the patient's ID number (consistent with that provided in adm), while birthDt and dxDt are the patient's birth and diagnosis dates, respectively.

I am conducting some longitudinal / time series analyses and would like to determine whether patients were or were not hospitalised during different time periods pre- and post- diagnosis. For the sake of brevity, this question is concerned with the one month before and after diagnosis. Ideally, I would like to:

  • Create a dichotomous variable ("Yes" / "No"), indicating whether a given patient spent time in hospital during the time period (i.e., I am not concerned if they were admitted before the onset of the time period or if they were discharged after the offset of the time period)
  • Calculate the number of times each patient was in hospital during the time period
  • Calculate the duration (number of days) each patient was in hospital during the time period

I have reviewed a number of posts over several days (e.g., R Time periods overlapping, Join dataframes by id and overlapping date range, how to show an event happened between two dates in R); however, none of them seem to combine the three aspects I am interested in (calculating time between overlapping dates; multiple data frames; by "group" [or individual]).

I am new to R and have had little experience with loops and more advanced formulae. It seems that it may be possible to use foverlaps, lubridate, or %overlaps% from the "DescTools" package; however, I am unsure how to construct the relevant formulae.

Any assistance would be greatly appreciated!

EDIT #1:

While @sirallen's suggestions worked for the specific time period in the example provided, sum(pmin(dxDt, sepDt) - pmax(admDt, dxDt)), by = "id" returned inaccurate values in my real dataset (e.g., patients with multiple admissions of duration one day ["2000-01-25" - "2000-01-26"] reportedly spent zero days in hospital. This seems strange to me, since the code seems to be be used to answer similar examples. Does this issue relate to the fact that I have several overlapping date ranges for those patients? Further, as noted by @sirallen, the code did not highlight when patients had one or more than one admission during the time period.

The code below provided a more direct route to the first two parts of my question, by determining a) whether a patient spent time in hospital and b) the number of admissions:

library(data.table)
setDT(adm)
setDT(admDx)[, (4:9) := NULL]

#Period bounds
admDx[, `:=`(dxDtN1 = dxDt %m-% months(1), dxDtP1 = dxDt %m+% months(1))]

#Hospitalised in the month preceding diagnosis
admDx <- adm[admDx, on = .(id, admDt < dxDt, sepDt > dxDtN1), .N, by = .EACHI]
admDx[, `:=` (admPreDx = factor(ifelse(N > 0, "Yes", "No")))]

However, the pmin / pmax code still does not work, returning negative values:

admDx[, `:=` (birthDt = birthDt, dxDt = dxDt, dxDtN1 = dxDt %m-% months(1), dxDtP1 = dxDt %m+% months(1))]
admDx[, `:=` (admPreDxDur=as.numeric(sum(pmin(dxDt, adm$sepDt) - pmax(dxDtN1, adm$admDt)))), by = "id"]
admDx <- select(admDx, admPreDx, N, admPreDxDur)


> head(admDx)
   admPreDx N admPreDxDur
1:       No 0      -28573
2:       No 0      -27160
3:       No 0      -28366
4:      Yes 1      -29357
5:       No 0      -26701
6:       No 0      -28044

EDIT #2

After testing additional cases, it seems that the issue re: pmin / pmax may relate to the use of > vs >=: when > is used, the correct Dur value is returned; however, when >= is used, Dur returns a value of 0.

How might this code be adapted to enable me to calculate the number of days up to, and including, the diagnosis date?

1

There are 1 answers

7
sirallen On BEST ANSWER

We can do this with non-equi joins in data.table (>=v1.9.8):

library(data.table)
setDT(adm)
setDT(admDx)[, (4:9):= NULL]

# period bounds
admDx[, `:=`(dxDtLo=dxDt-31, dxDtHi=dxDt+31)]

# hospitalized pre-dxnosis?
admDx = adm[, .(id, admDt, sepDt, dxDt=admDt, dxDtLo=sepDt)][admDx,
  on=.(id, dxDt < dxDt, dxDtLo > dxDtLo)]
admDx[, admPreDx:= as.numeric(!is.na(admDt))]
admDx[, `:=`(admPreDxNbr=sum(admPreDx), admPreDxDur=as.numeric(
  sum(pmin(dxDt,sepDt) - pmax(admDt,dxDtLo)))), by='id']
admDx[, c('admDt','sepDt'):= NULL]

# hospitalized post-dxnosis?
admDx = adm[, .(id, admDt, sepDt, dxDtHi=admDt, dxDt=sepDt)][admDx,
  on=.(id, dxDtHi < dxDtHi, dxDt > dxDt)]
admDx[, admPostDx:= as.numeric(!is.na(admDt))]
admDx[, `:=`(admPostDxNbr=sum(admPostDx), admPostDxDur=as.numeric(
  sum(pmin(sepDt,dxDtHi) - pmax(dxDt,admDt)))), by='id']
admDx[, c('admDt','sepDt'):= NULL]

admDx[is.na(admDx)] = 0
admDx = unique(admDx)[, c('dxDtLo','dxDtHi'):= NULL]

> admDx
#          id       dxDt    birthDt admPreDx admPreDxNbr admPreDxDur admPostDx admPostDxNbr admPostDxDur
#  1:       1 2000-02-24 1971-07-22        0           0           0         1            1            1
#  2:       2 2000-04-30 1982-08-09        0           0           0         1            1           25
#  3:       3 2000-03-03 1976-01-30        0           0           0         0            0            0
#  4:       4 2000-01-31 1972-02-03        1           1           6         0            0            0
#  5:       5 2000-06-20 1958-05-26        0           0           0         0            0            0
#  6:       6 2000-12-13 1979-05-24        0           0           0         0            0            0
#  7:       7 2000-05-14 1971-11-03        0           0           0         1            1           14
#  8:       8 2000-01-23 1980-02-05        0           0           0         1            1           31
#  9:       9 2000-03-09 1978-06-08        1           1          14         0            0            0
# 10:      10 2000-02-15 1969-10-14        1           1          19         1            1            8
# 11:      11 2000-05-01 1962-01-01        1           1          20         0            0            0
# 12:      12 2000-06-30 1977-03-09        1           1           2         1            1           27
# 13:      13 2000-08-15 1952-01-24        1           1          31         0            0            0
# 14:      14 2000-06-22 1974-12-16        1           1           9         0            0            0
# 15:      15 2000-01-27 1956-05-05        1           1          31         1            1           16
# 16: 5451396 2000-06-01 1963-07-16        1           1          25         1            1           31