I have been running bootstraps with rbinom for loops in R, but they take too long to run.
I want to perform the bootstrap on a dataset with 1,500,000 rows.
I want to resample the rows and for each of the resampled rows
- rbinom two probabilities ('prob1' & 'prob2') into 0's and 1's ('prob1_ber' & 'prob2_ber')
- add new column 'paired' with the combined outcome of step 1
- rbinom the unique combinations of the columns 'paired' and 'positive' into 0's and 1's ('prob_final')
- calculate 'pair_FPR' and 'pair_TPR'
This is what my code looks like:
library(boot)
#making example data
set.seed(1)
d2 <- data.frame(prob1=runif(n=1500000, min=1e-50, max=.9999999999),
prob2=runif(n=1500000, min=1e-44, max=.9999999989),
Positive=sample(c(0,1), replace=TRUE, size=1500000))
#making bootstrap function
function_1 <- function(data, i){
d2<-data[i,]
d2$prob1_ber <- rbinom(nrow(d2), 1, d2$prob1) #bernoulli 1 or 0
d2$prob2_ber <- rbinom(nrow(d2), 1, d2$prob2) #bernoulli 1 or 0
d2$paired <- ifelse(d2$prob1_ber == 1 & d2$prob2_ber == 1, '11',
ifelse(d2$prob1_ber == 0 & d2$prob2_ber ==0, '00',
ifelse(d2$prob1_ber == 1 & d2$prob2_ber ==0, '10',
ifelse(d2$prob1_ber == 0 & d2$prob2_ber ==1, '01', NA))))
d2$prob_final <- ifelse(d2$paired == '00',d2$prob1_ber, NA) #if both negative then negative
for (i in which(d2$paired =='11' & d2$Positive==1)) {
d2$prob_final[i] <- rbinom(1,1,0.9)
}
for (i in which(d2$paired =='11' & d2$Positive==0)) {
d2$prob_final[i] <- rbinom(1,1,0.5)
}
for (i in which(d2$paired =='01' & d2$Positive==1)) {
d2$prob_final[i] <- rbinom(1,1,0.8)
}
for (i in which(d2$paired =='01' & d2$Positive==0)) {
d2$prob_final[i] <- rbinom(1,1,0.1)
}
for (i in which(d2$paired =='10' & d2$Positive==1)) {
d2$prob_final[i] <- rbinom(1,1,0.7)
}
for (i in which(d2$paired =='10' & d2$Positive==0)) {
d2$prob_final[i] <- rbinom(1,1,0.2)
}
pair_FPR <- sum(d2[which(d2$Positive==0),]$prob_final) / nrow(d2[which(d2$Positive==0),])*100
pair_TPR <- sum(d2[which(d2$Positive==1),]$prob_final) / nrow(d2[which(d2$Positive==1),])*100
return(c(pair_FPR, pair_TPR))
}
set.seed(1)
boot_out <- boot(d2, function_1, 1000)
print(boot_out)
This bootstrap takes too long to run (n=1000). Is there a way to make it faster?
Many thanks!
There's a good reason for the old saw about "if you're using R and thinking of using a
for
loop, there's probably a better way to do it". I think this is a case in point.You've given no context or description of your overall objective and I've not taken the time to understand your code. I'm also confused about why you've taken advantage of R's vectorisation in some places, but not in others.
Also, I think the use of the
boot
library is a red herring. What matters is the underlying performance of your functionfunction_1
. Finally, I don't think there's any need to generate 150,000,000 obesrvations - or even 1,500,000 - to investigate underlying performance.Hence, my attempt to improve your function is:
And my test data is
Note that the results of
function_1(d2, i)
will not be the same as the results offunction_2(d2, i)
because of the order in which random numbers are generated. (function_2
works from row 1 to row n in order,function_1 works through rows in groups defined by
pairedand
positive`.) However, I believe the distributional properties of the two functions will be the same.So, to compare performance...
A mean relative reduction in execution time of 100 * (27.7 - 8.9) / 27.7 = 67.8%. Relative performance may well depend on
N
, but I expect the benefit to increase withN
because the benfits of vectorisation over looping should increase asN
increases.Bear in mind that using the tidyverse, whilst giving code that is usually easy to read and maintain, does not usually give the fastest execution times.
data.table
and base R are usually superior.I leave it to others to improve on my effort. I'm sure it can be done.