targets stalls with tar_make_clustermq()

95 views Asked by At

I have a long-ish targets pipeline (takes more than an hour to execute) for which parallel execution is possible. Specifically, many, but not all, calculations can be done in parallel across 155 countries and 60 years. There are times when the country-specific calculations are aggregated from continents to the world, for example, and that sum is not amenable to parallel execution. I am running the pipeline on my local machine only (not on a cluster or networked computer).

When I run the pipeline with 5 countries and tar_make_clustermq(workers = 8) (on an 10-core machines, 14-inch Apple silicon MacBook Pro and iMac Pro), the pipeline is successful. Furthermore, I see 5 processors in use simultaneously. However, when I run the pipeline with 6 or more countries, there are several times when the pipeline stalls or seemingly switches to single-threaded execution. I have found that I need to restart the pipeline with tar_make_clustermq(workers = 8) or (worse) restart with tar_make() (single-threaded) to get it going again.

The points in the pipeline when restart is required are 100% repeatable for a specific set of countries. The points in the pipeline when a restart is required changes with the countries in the analysis.

It would be pretty difficult to develop a reprex for this behavior, because of the large files and pipelines involved. So at this time, I am requesting suggestions for next steps for debugging or changing course altogether. Here are some specific questions:

  • I have searched and found only this report (https://github.com/ropensci/targets/issues/182). Have I missed other reports of similar behavior?
  • If others found unreliable behavior from targets and clustermq on the local machine, what hints can you provide for getting around these problems?
  • I have considered switching from clustermq to future in targets. I'm wondering if that switch would provide improvements. I have not tried future. So if someone has experience with both, I welcome your input.

Thanks in advance for any hints!

1

There are 1 answers

0
Matthew Kuperus Heun On BEST ANSWER

I switched to using future with much success. I'm using future::plan(future.callr::callr). My pipeline no longer hangs/stalls, as it did when using clustermq. Rather it completes without intervention, as desired. For this pipeline, at least, future is the way to go!