txtProgressBar for parallel bootstrap not displaying properly

852 views Asked by At

Below is a MWE of my problem: I have programmed a progress bar for some function using the bootstrap (via the boot function from the boot package).

This works fine as long as I don't use parallel processing (res_1core below). If I want to use parallel processing by setting parallel = "multicore" and ncpus = 2, the progress bar isn't displayed properly (res_2core below).

library(boot)

rsq <- function(formula, data, R, parallel = c("no", "multicore", "snow"), ncpus = 1) {
  env <- environment()
  counter <- 0
  progbar <- txtProgressBar(min = 0, max = R, style = 3)
  bootfun <- function(formula, data, indices) {
    d <- data[indices,]
    fit <- lm(formula, data = d)
    curVal <- get("counter", envir = env)
    assign("counter", curVal + 1, envir = env)
    setTxtProgressBar(get("progbar", envir = env), curVal + 1)
    return(summary(fit)$r.square)
  }
  res <- boot(data = data, statistic = bootfun, R = R, formula = formula, parallel = parallel, ncpus = ncpus)
  return(res)
}

res_1core <- rsq(mpg ~ wt + disp, data = mtcars, R = 1000)
res_2core <- rsq(mpg ~ wt + disp, data = mtcars, R = 1000, parallel = "multicore", ncpus = 2)

I have read that this is related to the fact that the boot function calls on lapply for single core processing and mclapply for multicore processing. Does anyone know of an easy workaround to deal with this? I mean, I would like to display the progress taking into account all of the parallel processes.

Update

Thanks to the input of Karolis Koncevičius, I have found a workaround (just use the updated rsq function below):

rsq <- function(formula, data, R, parallel = c("no", "multicore", "snow"), ncpus = 1) {
  bootfun <- function(formula, data, indices) {
    d <- data[indices,]
    fit <- lm(formula, data = d)
    return(summary(fit)$r.square)
  }

  env <- environment()
  counter <- 0
  progbar <- txtProgressBar(min = 0, max = R, style = 3)
  flush.console()

  intfun <- function(formula, data, indices) {
    curVal <- get("counter", envir = env) + ncpus
    assign("counter", curVal, envir = env)
    setTxtProgressBar(get("progbar", envir = env), curVal)
    bootfun(formula, data, indices)
  }
  res <- boot(data = data, statistic = intfun, R = R, formula = formula, parallel = parallel, ncpus = ncpus)
  return(res)
}

Unfortunately, this only works for multicore processing when I run R from the terminal. Any ideas how to patch this so it also displays properly in R console or Rstudio?

1

There are 1 answers

4
Karolis Koncevičius On

Not exactly what you ordered, but might be helpful.

A simple statistics function to boot:

library(boot)

bootfun <- function(formula, data, indices) {
    d <- data[indices,]
    fit <- lm(formula, data=d)
    summary(fit)$r.square
}

Higher order function to display progress:

progressReporter <- function(total, nBars=100, f, ...) {
    count <- 1
    step <- ceiling(total/nBars)
    cat(paste(rep("|", nBars), collapse=""), "\r")
    flush.console()
    function(...) {
        if (count %% step==0) {
            cat(".")
        }
        count <<- count + 1
        f(...)
    }
}

Now this function is cheating - it displays progress every "step" of iterations. If you have 1000 iterations, use two cores and print every 10th iteration - it will do the job. The cores don't share state, but they each will run the counter up to 500, and the function will respond to both counters.

On the other hand if you do 1000 iterations, run 10 cores and report every 200 - the function will stay silent, as all the cores will count to 100 each. None will reach 200 - no progress bar. Hope you get the idea. I think it should be ok in most of the cases.

Try it out:

res_1core <- boot(formula="mpg ~ wt + disp", data=mtcars, R=1000, statistic=progressReporter(1000, nBars=100, f=bootfun))
res_2core <- boot(formula="mpg ~ wt + disp", data=mtcars, R=1000, statistic=progressReporter(1000, nBars=100, f=bootfun), parallel="multicore", ncpus=2)