What is the most efficient way to utilize dask multiprocessing scheduler if data flow between tasks is big?

608 views Asked by At

We have a dask compute graph (quite custom so we use dask delayed instead of collections). I've read in the docs that current scheduling policy is LIFO so that a worker process has big chances to get the data it has just computed for further steps down the graph. But as far as I understood task computation results are still (de)serialized to hard drive in even in this case.

So the question is how much performance gain would I get trying to keep as little tasks as possible down a single path of independent computations in a graph:

A) many small "map" tasks along each path

t --> t --> t -->...
                     some reduce stage
t --> t --> t -->...

B) one huge "map" task along for each path

   T ->
        some reduce stage
   T -> 

Thank you!

1

There are 1 answers

3
MRocklin On BEST ANSWER

The dask multiprocessing scheduler will automatically fuse linear chains of tasks into single tasks, so your case A above will automatically become case B.

If your workloads are more complex and do require inter-node communication then you might want to try the distributed scheduler on a single computer. It manages data movement between workers more intelligently.

$ pip install dask distributed

>>> from dask.distributed import Client
>>> c = Client()  # Starts local "cluster".  Becomes the global scheduler

Further reading

Correction

Also, just as a note, Dask doesn't persist intermediate results on disk. Rather it communicates intermediate results directly between processes.