Safe & performant way to modify dask dataframe

1.3k views Asked by At

As a part of data workflow I need to modify values in a subset of dask dataframe columns and pass the results for further computation. In particular, I'm interested in 2 cases: mapping columns and mapping partitions. What is the recommended safe & performant way to act on the data? I'm running it a distributed setup on a cluster with multiple worker processes on each host.

Case1.

I want to run:

res = dataframe.column.map(func, ...)

this returns a data series so I assume that original dataframe is not modified. Is it safe to assign a column back to the dataframe e.g. dataframe['column']=res? Probably not. Should I make a copy with .copy() and then assign result to it like:

dataframe2 = dataframe.copy()
dataframe2['column'] = dataframe.column.map(func, ...)

Any other recommended way to do it?

Case2

I need to map partitions of the dataframe:

df.map_partitions(mapping_func, meta=df)

Inside the mapping_func() I want to modify values in chosen columns, either by using partition[column].map or simply by creating a list comprehension. Again, how do modify the partition safely and return it from the mapping function?

Partition received by mapping function is a Pandas dataframe (copy of original data?) but while modifying data in-place I'm seeing some crashes (no exception/error messages though). Same goes for calling partition.copy(deep=False), it doesn't work. Should partition be deep copied and then modified in-place? Or should I always construct a new dataframe out of new/mapped column data and original/unmodified series/columns?

1

There are 1 answers

1
MRocklin On BEST ANSWER

You can safely modify a dask.dataframe

Operations like the following are supported and safe

df['col'] = df['col'].map(func)

This modifies the task graph in place but does not modify the data in place (assuming that the function func creates a new series).

You can not safely modify a partition

Your second case when you map_partitions a function that modifies a pandas dataframe in place is not safe. Dask expects to be able to reuse data, call functions twice if necessary, etc.. If you have such a function then you should create a copy of the Pandas dataframe first within that function.