How to save dataframe partitions one by one to same local database?

15 views Asked by At

I have ~30GB uncompressed spatial data contains geometries, id, some strings. and they are kept as Dask DataFrame with these columns. id|geometry |...| 12|POINT(..)|...|

Because they are too big, and can't fit in any workers with one big dataframe, so I consider to save them into local database per partition on the workers.

I don't know if it is correct way? Or there is better way to do this? BTW, I don't care the returned data, so i put a meta. I assume compute call is a synchronized, it will return after the all save calls returned.

 def save(self, gdf, layer):
     features = gdf["geometry"].values
     if features.shape[0] > 0:
        ... a call to file database
        return dd.utils.make_meta(features)

dataframe.map_partitions(self.save, layer=layer, meta=dd.utils.make_meta(features)).compute()

Thanks

1

There are 1 answers

0
GUOZHAN SUN On

I found the answer from this page https://docs.dask.org/en/stable/delayed-collections.html

dfs = [delayed(load)(fn) for fn in filenames]
df = dd.from_delayed(dfs)
df = ... # do work with dask.dataframe
dfs = df.to_delayed()
writes = [delayed(save)(df, fn) for df, fn in zip(dfs, filenames)]
dd.compute(*writes)