Is it ok for Dask Bags to return Dask Bags?

Question

Is it ok for Dask Bags to return Dask Bags?

74 views Asked by bcollins At 04 October 2020 at 09:07

Is it kosher for dask.bag to return / concat other bags?

Any advantages / pitfalls of interest?

As a simple example:

import dask.bag as bag

bg = (bag.from_sequence(['a', 'b', 'c'])
         .map(lambda letter: bag.from_sequence([f'{letter} - 1', 
               f'{letter} - 2']))
         .fold(lambda x, y: bag.concat((x, y)))
         .compute())
bg.compute()

>>> ['a - 1', 'a - 2', 'b - 1', 'b - 2', 'c - 1', 'c - 2']

Some pitfalls I can think of:

need to call compute(scheduler='synchronous') nested bags...
I called compute in the example above twice which is awkward.

Any thoughts on advantages?

Original Q&A

There are 1 answers

**mdurant** · Accepted Answer · 2020-10-04T18:39:02+00:00

This is definitely an anti-pattern - clearly you would, at the very least, be dealing with partitions on very different size-scales. You would also have a bottleneck on the inner computes. In general, Dask collections are meant to deal with core data types, which for bag would mean you using normal iterator/toolz operations (which the bag API is mostly copied from).

TechQA.

Is it ok for Dask Bags to return Dask Bags?

There are 1 answers

Related Questions in PYTHON

Related Questions in DASK

Related Questions in BAG

Popular Questions

Popular Tags

Trending Questions