Range of dates in a period of analysis in R

1.3k views Asked by At

I have a data frame Data1 with three columns: NoContract,IniDate,FinDate representing the identifier of a contract, when a contract begins and when finishes respectively. On the other hand I have a period of analysis: January 1, 2012 to December 31, 2014. I want to find how many contracts where active in each month of the analysis period, by active I mean that a contract has at least one day of its dates between IniDate and FinDate in a month of analysis period.

I tried in R doing:

Lets say Data1 is:

Data1 <- data.frame(NoContract= 1:3, IniDate= as.Date(c("2011-05-03","2012-03-13","2014-03-26")),FinDate=as.Date(c("2015-01-05","2013-03-13","2015-08-19")))
Data1

  NoContract    IniDate    FinDate
1          1 2011-05-03 2015-01-05
2          2 2012-03-13 2013-03-13
3          3 2014-03-26 2015-08-19

I´ve created another data frame DatesCalc as:

DatesCalc<-data.frame(monthI=seq(as.Date("2012-01-01"), as.Date("2014-12-31"), by="1 month"), monthF=(seq(as.Date("2012-02-01"), as.Date("2015-01-01"), by="1 month")-1))
head(DatesCalc)

      monthI     monthF
1 2012-01-01 2012-01-31
2 2012-02-01 2012-02-29
3 2012-03-01 2012-03-31
4 2012-04-01 2012-04-30
5 2012-05-01 2012-05-31
6 2012-06-01 2012-06-30

Next, I wrote a function

myfun<-function(X,Y){
  d1<-numeric()
  d2<-numeric()
  for (i in 1:36){ #36 num of rows on DatesCalc
    d1<-numeric()
    for (j in 1:3){ #3 num of rows of my Data1 (my actual case near 550K rows)
      d1<-c(d1,sum(seq(X[i,1],X[i,2],by=1)%in%seq(Y[j,2],Y[j,3],by=1),na.rm=TRUE)>0)
    }
d2<-cbind(d2,d1)
  }
  return(d2)
}

So what it does is, for each row of Data1, creates a sequence of dates of each row of DatesCalcand prove if it is within the sequence of dates of the current row of Data1. This function returns a matrix where rows represent a contract and columns months from Jan 2012 to Dec 2014, and each cell has 1 if in a month the contract was active, and 0 if not (see Res). Finally I used apply to sum by column and got what i want.

Res<-myfun(DatesCalc,Data1)
Res
     d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1
[1,]  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
[2,]  0  0  1  1  1  1  1  1  1  1  1  1  1  1  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
[3,]  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  1  1  1  1  1  1  1  1  1

apply(Res,2,sum)
d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 
 1  1  2  2  2  2  2  2  2  2  2  2  2  2  2  1  1  1  1  1  1  1  1  1  1  1  2  2  2  2  2  2  2  2  2  2

The case is that I have hundreds of thousand of rows (550K) in my actual Data1, and running myfun on it is inefficient. My question is, maybe is a way doing this efficient in R? Or any suggest on how to improve my code. Thank you Comunnity.

1

There are 1 answers

4
agstudy On BEST ANSWER

Here an option using data.table foverlaps.

  1. First , foverlaps is a merge using interval. You should have the same column names to do the merge. You should also set the keys of the second table.
  2. LThe desired output is a matrix where rows represent a contract and columns months from Jan 2012 to Dec 2014, so I cretaed a new column period which is the year-month of the contract.
  3. Reshape the result in the wide format using dcast.data.table .

The code:

library(data.table)
setDT(Data1)
setDT(DatesCalc)
setkey(Data1, IniDate, FinDate)   ## Set keys for merge 
setnames(DatesCalc,names(DatesCalc),c('IniDate','FinDate')) ## rename for merge
dcast.data.table(        ## wide format
  foverlaps(DatesCalc, Data1, type="within")[,
        period := format(i.IniDate,'%Y-%m')], ## create a new variable here
  NoContract~period,fun=length) ## the aggregate function is the length (T/F)

  NoContract 2012-01 2012-02 2012-03 2012-04 2012-05 2012-06 2012-07 2012-08 2012-09 2012-10 2012-11 2012-12 2013-01 2013-02 2013-03 2013-04 2013-05 2013-06 2013-07
1:          1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1
2:          2       0       0       0       1       1       1       1       1       1       1       1       1       1       1       0       0       0       0       0
3:          3       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0
   2013-08 2013-09 2013-10 2013-11 2013-12 2014-01 2014-02 2014-03 2014-04 2014-05 2014-06 2014-07 2014-08 2014-09 2014-10 2014-11 2014-12
1:       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1
2:       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0
3:       0       0       0       0       0       0       0       0       1       1       1       1       1       1       1       1       1