Bootstrap with rbinom in R takes too long to run

122 views Asked by At

I have been running bootstraps with rbinom for loops in R, but they take too long to run.

I want to perform the bootstrap on a dataset with 1,500,000 rows.

I want to resample the rows and for each of the resampled rows

  1. rbinom two probabilities ('prob1' & 'prob2') into 0's and 1's ('prob1_ber' & 'prob2_ber')
  2. add new column 'paired' with the combined outcome of step 1
  3. rbinom the unique combinations of the columns 'paired' and 'positive' into 0's and 1's ('prob_final')
  4. calculate 'pair_FPR' and 'pair_TPR'

This is what my code looks like:

library(boot)

#making example data
set.seed(1)
d2 <- data.frame(prob1=runif(n=1500000, min=1e-50, max=.9999999999),
                 prob2=runif(n=1500000, min=1e-44, max=.9999999989),
                 Positive=sample(c(0,1), replace=TRUE, size=1500000))

#making bootstrap function
function_1 <- function(data, i){
  d2<-data[i,]
  
  d2$prob1_ber <- rbinom(nrow(d2), 1, d2$prob1) #bernoulli 1 or 0
  d2$prob2_ber <- rbinom(nrow(d2), 1, d2$prob2) #bernoulli 1 or 0
  
  d2$paired <- ifelse(d2$prob1_ber == 1 & d2$prob2_ber == 1, '11',
                      ifelse(d2$prob1_ber == 0 & d2$prob2_ber ==0, '00',
                             ifelse(d2$prob1_ber == 1 & d2$prob2_ber ==0, '10',
                                    ifelse(d2$prob1_ber == 0 & d2$prob2_ber ==1, '01', NA)))) 
  
  d2$prob_final <- ifelse(d2$paired == '00',d2$prob1_ber, NA) #if both negative then negative
  
  for (i in which(d2$paired =='11' & d2$Positive==1)) {
    d2$prob_final[i] <- rbinom(1,1,0.9)
  }
  for (i in which(d2$paired =='11' & d2$Positive==0)) {
    d2$prob_final[i] <- rbinom(1,1,0.5)
  }
  for (i in which(d2$paired =='01' & d2$Positive==1)) {
    d2$prob_final[i] <- rbinom(1,1,0.8)
  }
  for (i in which(d2$paired =='01' & d2$Positive==0)) {
    d2$prob_final[i] <- rbinom(1,1,0.1)
  }
  for (i in which(d2$paired =='10' & d2$Positive==1)) {
    d2$prob_final[i] <- rbinom(1,1,0.7)
  }
  for (i in which(d2$paired =='10' & d2$Positive==0)) {
    d2$prob_final[i] <- rbinom(1,1,0.2)
  }
  
  pair_FPR <- sum(d2[which(d2$Positive==0),]$prob_final) / nrow(d2[which(d2$Positive==0),])*100
  
  pair_TPR <- sum(d2[which(d2$Positive==1),]$prob_final) / nrow(d2[which(d2$Positive==1),])*100
  
  return(c(pair_FPR, pair_TPR))
}


set.seed(1)
boot_out <- boot(d2, function_1, 1000)
print(boot_out)

This bootstrap takes too long to run (n=1000). Is there a way to make it faster?

Many thanks!

1

There are 1 answers

1
Limey On

There's a good reason for the old saw about "if you're using R and thinking of using a for loop, there's probably a better way to do it". I think this is a case in point.

You've given no context or description of your overall objective and I've not taken the time to understand your code. I'm also confused about why you've taken advantage of R's vectorisation in some places, but not in others.

Also, I think the use of the boot library is a red herring. What matters is the underlying performance of your function function_1. Finally, I don't think there's any need to generate 150,000,000 obesrvations - or even 1,500,000 - to investigate underlying performance.

Hence, my attempt to improve your function is:

function_2 <- function(data, i){
  d2<-data[i,] %>% 
    mutate(
      prob1_ber=rbinom(nrow(.), 1, prob1), #bernoulli 1 or 0
      prob2_ber=rbinom(nrow(.), 1, prob2), #bernoulli 1 or 0
      paired=ifelse(prob1_ber == 1 & prob2_ber == 1, '11',
                      ifelse(prob1_ber == 0 & prob2_ber ==0, '00',
                             ifelse(prob1_ber == 1 & prob2_ber ==0, '10',
                                    ifelse(prob1_ber == 0 & prob2_ber ==1, '01', NA)))), 
      dprob_final=case_when(
        paired == '00' ~ prob1_ber,
        paired =='11' & Positive==1 ~ rbinom(1,1,0.9),
        paired =='11' & Positive==0 ~ rbinom(1,1,0.5),
        paired =='01' & Positive==1 ~ rbinom(1,1,0.8),
        paired =='01' & Positive==0 ~ rbinom(1,1,0.1),
        paired =='10' & Positive==1 ~ rbinom(1,1,0.7),
        paired =='10' & Positive==0 ~ rbinom(1,1,0.2)
    )
  )
  
  pair_FPR <- sum(d2[which(d2$Positive==0),]$prob_final) / nrow(d2[which(d2$Positive==0),])*100
  
  pair_TPR <- sum(d2[which(d2$Positive==1),]$prob_final) / nrow(d2[which(d2$Positive==1),])*100
  
  return(c(pair_FPR, pair_TPR))
}

And my test data is

N <- 15000
#making example data
set.seed(1)
d2 <- data.frame(prob1=runif(n=N, min=1e-50, max=.9999999999),
                 prob2=runif(n=N, min=1e-44, max=.9999999989),
                 Positive=sample(c(0,1), replace=TRUE, size=1500000))

Note that the results of function_1(d2, i) will not be the same as the results of function_2(d2, i) because of the order in which random numbers are generated. (function_2 works from row 1 to row n in order, function_1 works through rows in groups defined by pairedandpositive`.) However, I believe the distributional properties of the two functions will be the same.

So, to compare performance...

library(microbenchmark)

microbenchmark(
  list=list(
         "f1"= function_1(d2, 1), 
         "f2"= function_2(d2, 1)
       ), 
  times=10
)
Unit: nanoseconds
 expr min lq mean median uq max neval
   f1   7  9 28.7      9  9 203    10
   f2   8  9  8.9      9  9  10    10

A mean relative reduction in execution time of 100 * (27.7 - 8.9) / 27.7 = 67.8%. Relative performance may well depend on N, but I expect the benefit to increase with N because the benfits of vectorisation over looping should increase as N increases.

Bear in mind that using the tidyverse, whilst giving code that is usually easy to read and maintain, does not usually give the fastest execution times. data.table and base R are usually superior.

I leave it to others to improve on my effort. I'm sure it can be done.