Is it ok for Dask Bags to return Dask Bags?

84 views Asked by At

Is it kosher for dask.bag to return / concat other bags?

Any advantages / pitfalls of interest?

As a simple example:

import dask.bag as bag

bg = (bag.from_sequence(['a', 'b', 'c'])
         .map(lambda letter: bag.from_sequence([f'{letter} - 1', 
               f'{letter} - 2']))
         .fold(lambda x, y: bag.concat((x, y)))
         .compute())
bg.compute()

>>> ['a - 1', 'a - 2', 'b - 1', 'b - 2', 'c - 1', 'c - 2']

Some pitfalls I can think of:

  • need to call compute(scheduler='synchronous') nested bags...
  • I called compute in the example above twice which is awkward.

Any thoughts on advantages?

1

There are 1 answers

0
mdurant On BEST ANSWER

This is definitely an anti-pattern - clearly you would, at the very least, be dealing with partitions on very different size-scales. You would also have a bottleneck on the inner computes. In general, Dask collections are meant to deal with core data types, which for bag would mean you using normal iterator/toolz operations (which the bag API is mostly copied from).