Can I set different degrees of parallelism for different targets with the R {targets} package?

370 views Asked by At

I'm testing out the targets package and am running into a problem with customizing parallelization. My workflow has two steps, and I'd like to parallelize the first step over 4 workers and the second step over 16 workers.

I want to know if I can solve the problem by calling tar_make_future(), and then specifying how many workers each step requires in the tar_target calls. I've got a simple example below, where I'd like the data step to execute with 1 worker, and the sums step to execute with 3 workers.

library(targets)

tar_dir({
  tar_script({
    library(future)
    library(future.callr)
    library(dplyr)

    plan(callr)

    list(
      # Goal: this step should execute with 1 worker
      tar_target(
        data,
        data.frame(
          x = seq_len(6),
          id = rep(letters[seq_len(3)], each = 2)
        ) %>%
          group_by(id) %>%
          tar_group(),
        iteration = "group"
      ),
      # Goal: this step should execute with 3 workers, in parallel
      tar_target(
        sums,
        sum(data$x),
        pattern = map(data),
        iteration = "vector"
      )
    )
  })
  tar_make_future()
})

I know that one option is to configure the parallel backend separately within each step, and then call tar_make() to execute the workflow serially. I'm curious about whether I can get this kind of result with tar_make_future().

2

There are 2 answers

0
landau On BEST ANSWER

I would recommend that you call tar_make_future(workers = <max_parallel_workers>) and let targets figure out how many workers to run in parallel. targets automatically figures out which targets can run in parallel and which need to wait for upstream dependencies to finish. In your case, some of the data branches may finish before others, in which case sum can start right away. In other words, some sum branches will start running before other sum branches can start, and you can trust targets to scale up transient workers when the need arises. The animation at https://books.ropensci.org/targets/hpc.html#future may help visualize this. If you were to micromanage the parallelism for data and sum separately, you would likely have to wait for all of data to finish before any of sum can start, which could take a long time.

0
adviksh On

A solution that worked for my case was to call tar_make_future() twice. In the example above, that would be:

tar_make_future(data, workers = 1)
tar_make_future(workers = 3)

Though in my actual workflow it looks more like:

tar_make_future(data, workers = 4)
tar_make_future(workers = <max_parallel_workers>)

@landau raises a good point that this completely builds the data target before moving on to subsequent steps. There are certainly workflows where a clean and effective solution is to call tar_make_future(workers = <max_parallel_workers>) and accept the resulting runtime.

In my case, waiting for data to finish wasn't an issue: my data target contained many branches that were fast, subsequent targets were much slower to build, and I could parallelize the slow step over many more workers than the fast step (16+ workers for the slow step, vs. just 4 for the fast step). If these aren't true of your situation, @landau's suggestion may be a better solution.