I have a very big dataset and I would like to perform the following computation in R using data.table
:
library(data.table)
# This is a test dataset
tempData <-data.table(
drugName = rep("Aspirine", times = 4),
startdt = c("2012-01-01",
"2012-01-20",
"2012-02-15",
"2012-03-10"),
daysupp = c(30,30,10,20))
# An example of the desired computation
tempData[, startdt:= as.Date(startdt)]
tempData[1, enddt:= startdt + daysupp]
for (i in 2:nrow(tempData)) {
if (tempData[i,startdt] >= tempData[i-1,enddt]) {
tempData[i, enddt:= startdt + daysupp]
} else {
tempData[i, enddt:= tempData[i-1,enddt] + daysupp]
}
}
This computation should be made for different drug names so I can create a function of the for loop
and use it into my DT
with a group by
on brandname. This computation takes a lot of time. I am wondering if there is a way to sequentially update the data.table
rows using a vectorized approach.
I was thinking of using shift
however I cannot find a way to update the enddt
variable sequentially by following these two if
statements.
This is a general question on how to approach this type of computations really fast.
I'd write a simple Rcpp function instead of spending time trying to find a vectorized R solution: