I'm trying to process a dataset and make incremental updates as writing it out in Dask. The Dask metadata file would help a lot when it comes to rereading the processed data. However, as I write new partitions/subsets to the same path, the metadata there gets overwritten by the new partitions/subsets rather than updated with them included.
import dask.dataframe as dd
df = dd.read_parquet(read_path)
# some transformations
df = …
df.to_parquet(write_path, partition_on=[col1, col2, …], write_metadata_file=True)
Looked at a few places and haven't found an obvious way to do this. Does anyone know if anyone has done something that handles such a use case? Could be either incrementally update the metadata files or make edits to/combine a few of them. Any suggestions will be appreciated.
This problem is specific to the
fastparquet
engine (works fine inpyarrow
).