Imagine I hand you a ping pong ball with a "-1" printed on it. I then tell you to draw another ping pong ball from a bag marked "First Bag". This bag has 30,000 balls in it, some marked with a "-1", some with "0", and some with "+1". Whichever ball you draw, you add its number to your current "score" of -1. E.g., if you draw a -1 your new score is -2.
So long as your new score is below zero you draw again from the first bag and add to your score again. But if and when your score gets to zero or higher, you then draw from a Second Bag which has a different composition of -1s 0s and +1s.
I want you to draw a total of 1000 ping pong balls from the appropriate bags (i.e., depending on whether your current score is below zero or not) and then write down your total (cumulative) score at the end of the "set". Then I want you to repeat this experiment a million times and tell me what percentage of the sets you ended up with a score higher than zero.
Is there a faster/more efficient way to code this? It's difficult to vectorize the loop since the draws are not independent, though maybe I can use some combo of ifelse
and filter
? I suspect it's the replicating that's the expensive part though.
ptm <- proc.time()
###First bag
n=30000
s=155
f=255
z=n-s-f
first_bag=c(rep(0,z), rep(1,s), rep(-1,f))
###Second bag
n2=30000
s2=275
f2=285
z2=n2-s2-f2
second_bag=c(rep(0,z2), rep(1,s2), rep(-1,f2))
###Simulate draws
sim_draw=function(draws){
score=-1
for (i in 1:draws) {
if (score < 0) {
score=score + sample(first_bag, 1, replace=TRUE)} else {
score=score + sample(second_bag, 1, replace=TRUE)}
}
score
}
###Repeat sims and find area above zero
samp_distribution=replicate(1000000, sim_draw(1000))
mean(samp_distribution>0)
print(proc.time() - ptm)
A few ideas:
Learn to use the profiler to see where your implementation is wasting time. See the example at the bottom of
?summaryRprof
for how to use it, just replaceexample(glm)
with your code:samp_distribution <- replicate(1000, sim_draw(1000))
. You will notice that a lot of time is wasted callingsample
over and over. So the first improvement to your code can be to callsample
only a couple times:See that this is nearly ten times faster (I find the microbenchmark package to be a more reliable method for measuring/comparing computation times)
For very iterative code like yours, it is always worth trying the compiler:
Another 3x improvement, not bad.
Last, since what is still the biggest bottle neck inside the function is the for loop, you can try to rewrite it so instead of processing one outcome at a time, you use vectorized (fast) functions to process as many outcomes as possibles (the exact number of outcomes that will force you to switch hats.)
Another 2x improvement. Here you see that the compiler only gets us a marginal improvement because the number of iterations has dropped dramatically, and everything else in our R code is using very efficient (vectorized) functions.
So we got from 5845 microseconds to 124 per function call, a pretty good improvement. If this is still too slow, then you will probably have to switch to c++ (via Rcpp for example). At least I hope this helped show you some useful tricks.
Last but not least, I would mention that since your function calls are all independent, you could look into running them in parallel. I'll point you to http://cran.r-project.org/web/views/HighPerformanceComputing.html and encourage you to search around.