I have a data frame Data1
with three columns: NoContract
,IniDate
,FinDate
representing the identifier of a contract, when a contract begins and when finishes respectively. On the other hand I have a period of analysis: January 1, 2012 to December 31, 2014. I want to find how many contracts where active in each month of the analysis period, by active I mean that a contract has at least one day of its dates between IniDate
and FinDate
in a month of analysis period.
I tried in R doing:
Lets say Data1
is:
Data1 <- data.frame(NoContract= 1:3, IniDate= as.Date(c("2011-05-03","2012-03-13","2014-03-26")),FinDate=as.Date(c("2015-01-05","2013-03-13","2015-08-19")))
Data1
NoContract IniDate FinDate
1 1 2011-05-03 2015-01-05
2 2 2012-03-13 2013-03-13
3 3 2014-03-26 2015-08-19
I´ve created another data frame DatesCalc as:
DatesCalc<-data.frame(monthI=seq(as.Date("2012-01-01"), as.Date("2014-12-31"), by="1 month"), monthF=(seq(as.Date("2012-02-01"), as.Date("2015-01-01"), by="1 month")-1))
head(DatesCalc)
monthI monthF
1 2012-01-01 2012-01-31
2 2012-02-01 2012-02-29
3 2012-03-01 2012-03-31
4 2012-04-01 2012-04-30
5 2012-05-01 2012-05-31
6 2012-06-01 2012-06-30
Next, I wrote a function
myfun<-function(X,Y){
d1<-numeric()
d2<-numeric()
for (i in 1:36){ #36 num of rows on DatesCalc
d1<-numeric()
for (j in 1:3){ #3 num of rows of my Data1 (my actual case near 550K rows)
d1<-c(d1,sum(seq(X[i,1],X[i,2],by=1)%in%seq(Y[j,2],Y[j,3],by=1),na.rm=TRUE)>0)
}
d2<-cbind(d2,d1)
}
return(d2)
}
So what it does is, for each row of Data1
, creates a sequence of dates of each row of DatesCalc
and prove if it is within the sequence of dates of the current row of Data1
. This function returns a matrix where rows represent a contract and columns months from Jan 2012 to Dec 2014, and each cell has 1
if in a month the contract was active, and 0
if not (see Res
). Finally I used apply to sum by column and got what i want.
Res<-myfun(DatesCalc,Data1)
Res
d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1
[1,] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[2,] 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[3,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
apply(Res,2,sum)
d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1
1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2
The case is that I have hundreds of thousand of rows (550K) in my actual Data1
, and running myfun
on it is inefficient. My question is, maybe is a way doing this efficient in R? Or any suggest on how to improve my code. Thank you Comunnity.
Here an option using
data.table foverlaps
.foverlaps
is a merge using interval. You should have the same column names to do the merge. You should also set the keys of the second table.dcast.data.table
.The code: