I have ~30GB uncompressed spatial data contains geometries, id, some strings. and they are kept as Dask DataFrame with these columns. id|geometry |...| 12|POINT(..)|...|
Because they are too big, and can't fit in any workers with one big dataframe, so I consider to save them into local database per partition on the workers.
I don't know if it is correct way? Or there is better way to do this? BTW, I don't care the returned data, so i put a meta. I assume compute call is a synchronized, it will return after the all save calls returned.
def save(self, gdf, layer):
features = gdf["geometry"].values
if features.shape[0] > 0:
... a call to file database
return dd.utils.make_meta(features)
dataframe.map_partitions(self.save, layer=layer, meta=dd.utils.make_meta(features)).compute()
Thanks
I found the answer from this page https://docs.dask.org/en/stable/delayed-collections.html