R code: Is there a way to make this Monte Carlo simulation quicker?

1.2k views Asked by At

Imagine I hand you a ping pong ball with a "-1" printed on it. I then tell you to draw another ping pong ball from a bag marked "First Bag". This bag has 30,000 balls in it, some marked with a "-1", some with "0", and some with "+1". Whichever ball you draw, you add its number to your current "score" of -1. E.g., if you draw a -1 your new score is -2.

So long as your new score is below zero you draw again from the first bag and add to your score again. But if and when your score gets to zero or higher, you then draw from a Second Bag which has a different composition of -1s 0s and +1s.

I want you to draw a total of 1000 ping pong balls from the appropriate bags (i.e., depending on whether your current score is below zero or not) and then write down your total (cumulative) score at the end of the "set". Then I want you to repeat this experiment a million times and tell me what percentage of the sets you ended up with a score higher than zero.

Is there a faster/more efficient way to code this? It's difficult to vectorize the loop since the draws are not independent, though maybe I can use some combo of ifelse and filter? I suspect it's the replicating that's the expensive part though.

ptm <- proc.time()

###First bag
n=30000
s=155
f=255
z=n-s-f
first_bag=c(rep(0,z), rep(1,s), rep(-1,f))

###Second bag
n2=30000
s2=275
f2=285
z2=n2-s2-f2
second_bag=c(rep(0,z2), rep(1,s2), rep(-1,f2))

###Simulate draws
sim_draw=function(draws){

  score=-1

  for (i in 1:draws) {
    if (score < 0) {
      score=score + sample(first_bag, 1, replace=TRUE)} else {
      score=score + sample(second_bag, 1, replace=TRUE)}
  }
  score
}

###Repeat sims and find area above zero
samp_distribution=replicate(1000000, sim_draw(1000))
mean(samp_distribution>0)


print(proc.time() - ptm)
1

There are 1 answers

1
flodel On BEST ANSWER

A few ideas:

  1. R is really not good at this kind of iterative process. A compiled language will perform much better. I'll assume here that you want to stick with basic R.
  2. Learn to use the profiler to see where your implementation is wasting time. See the example at the bottom of ?summaryRprof for how to use it, just replace example(glm) with your code: samp_distribution <- replicate(1000, sim_draw(1000)). You will notice that a lot of time is wasted calling sample over and over. So the first improvement to your code can be to call sample only a couple times:

    sim_draw_1 <- function(draws){
    
       s1 <- sample(bag1, draws, replace = TRUE)
       s2 <- sample(bag2, draws, replace = TRUE)
       score <- -1
    
       for (i in 1:draws)
          score <- score + if (score < 0) s1[i] else s2[i]
    
       score
    }
    

See that this is nearly ten times faster (I find the microbenchmark package to be a more reliable method for measuring/comparing computation times)

    library(microbenchmark)
    microbenchmark(sim_draw(1000), sim_draw_1(1000),
                   times = 1000)
    # Unit: microseconds
    #              expr      min       lq    median       uq       max neval
    #    sim_draw(1000) 5518.758 5845.465 6036.1375 6340.662 53023.483  1000
    #  sim_draw_1(1000)  690.796  730.292  743.8435  785.071  8248.163  1000
  1. For very iterative code like yours, it is always worth trying the compiler:

    library(compiler)
    sim_draw_2 <- cmpfun(sim_draw_1)
    library(microbenchmark)
    microbenchmark(sim_draw_1(1000), sim_draw_2(1000), times = 1000)
    # Unit: microseconds
    #              expr     min       lq   median      uq      max neval
    #  sim_draw_1(1000) 684.687 717.6640 748.3305 812.971 9412.936  1000
    #  sim_draw_2(1000) 242.895 259.8125 268.3925 294.343 1710.290  1000
    

Another 3x improvement, not bad.

  1. Last, since what is still the biggest bottle neck inside the function is the for loop, you can try to rewrite it so instead of processing one outcome at a time, you use vectorized (fast) functions to process as many outcomes as possibles (the exact number of outcomes that will force you to switch hats.)

    sim_draw_3 <- function(draws, bag1 = first_bag,
                           bag2 = second_bag){
    
       s1 <- sample(bag1, draws, replace = TRUE)
       s2 <- sample(bag2, draws, replace = TRUE)
    
       score <- -1L
       idx   <- 1L
       while (idx <= draws) {
          bag         <- if (score < 0) s1 else s2
          switch.at   <- if (score < 0) 0L else -1L
          next.draws  <- bag[idx:draws]
          next.scores <- score + cumsum(next.draws)
          stop.idx    <- which(next.scores == switch.at)[1]
          if (is.na(stop.idx)) stop.idx <- length(next.draws)
          score <- next.scores[stop.idx]
          idx   <- idx + stop.idx
       } 
       score
    }
    sim_draw_4 <- cmpfun(sim_draw_3)
    
    microbenchmark(sim_draw_2(1000), sim_draw_3(1000), sim_draw_4(1000), times = 1000)
    # Unit: microseconds
    #              expr     min      lq   median       uq      max neval
    #  sim_draw_2(1000) 236.916 252.540 269.1355 293.7775 7819.841  1000
    #  sim_draw_3(1000)  80.527  95.185 128.9840 162.7790  625.986  1000
    #  sim_draw_4(1000)  79.486  92.378 123.5535 162.5085  518.594  1000
    

Another 2x improvement. Here you see that the compiler only gets us a marginal improvement because the number of iterations has dropped dramatically, and everything else in our R code is using very efficient (vectorized) functions.

So we got from 5845 microseconds to 124 per function call, a pretty good improvement. If this is still too slow, then you will probably have to switch to c++ (via Rcpp for example). At least I hope this helped show you some useful tricks.

Last but not least, I would mention that since your function calls are all independent, you could look into running them in parallel. I'll point you to http://cran.r-project.org/web/views/HighPerformanceComputing.html and encourage you to search around.