As a part of data workflow I need to modify values in a subset of dask dataframe columns and pass the results for further computation. In particular, I'm interested in 2 cases: mapping columns and mapping partitions. What is the recommended safe & performant way to act on the data? I'm running it a distributed setup on a cluster with multiple worker processes on each host.
Case1.
I want to run:
res = dataframe.column.map(func, ...)
this returns a data series so I assume that original dataframe is not modified. Is it safe to assign a column back to the dataframe e.g. dataframe['column']=res
? Probably not. Should I make a copy with .copy() and then assign result to it like:
dataframe2 = dataframe.copy()
dataframe2['column'] = dataframe.column.map(func, ...)
Any other recommended way to do it?
Case2
I need to map partitions of the dataframe:
df.map_partitions(mapping_func, meta=df)
Inside the mapping_func() I want to modify values in chosen columns, either by using partition[column].map
or simply by creating a list comprehension. Again, how do modify the partition safely and return it from the mapping function?
Partition received by mapping function is a Pandas dataframe (copy of original data?) but while modifying data in-place I'm seeing some crashes (no exception/error messages though). Same goes for calling partition.copy(deep=False)
, it doesn't work. Should partition be deep copied and then modified in-place? Or should I always construct a new dataframe out of new/mapped column data and original/unmodified series/columns?
You can safely modify a dask.dataframe
Operations like the following are supported and safe
This modifies the task graph in place but does not modify the data in place (assuming that the function
func
creates a new series).You can not safely modify a partition
Your second case when you
map_partitions
a function that modifies a pandas dataframe in place is not safe. Dask expects to be able to reuse data, call functions twice if necessary, etc.. If you have such a function then you should create a copy of the Pandas dataframe first within that function.