Is it kosher for dask.bag
to return / concat other bags?
Any advantages / pitfalls of interest?
As a simple example:
import dask.bag as bag
bg = (bag.from_sequence(['a', 'b', 'c'])
.map(lambda letter: bag.from_sequence([f'{letter} - 1',
f'{letter} - 2']))
.fold(lambda x, y: bag.concat((x, y)))
.compute())
bg.compute()
>>> ['a - 1', 'a - 2', 'b - 1', 'b - 2', 'c - 1', 'c - 2']
Some pitfalls I can think of:
- need to call
compute(scheduler='synchronous')
nested bags... - I called compute in the example above twice which is awkward.
Any thoughts on advantages?
This is definitely an anti-pattern - clearly you would, at the very least, be dealing with partitions on very different size-scales. You would also have a bottleneck on the inner computes. In general, Dask collections are meant to deal with core data types, which for bag would mean you using normal iterator/toolz operations (which the bag API is mostly copied from).