Is there a way to incrementally update Dask metadata file?

487 views Asked by At

I'm trying to process a dataset and make incremental updates as writing it out in Dask. The Dask metadata file would help a lot when it comes to rereading the processed data. However, as I write new partitions/subsets to the same path, the metadata there gets overwritten by the new partitions/subsets rather than updated with them included.

import dask.dataframe as dd

df = dd.read_parquet(read_path)
# some transformations
df = …
df.to_parquet(write_path, partition_on=[col1, col2, …], write_metadata_file=True)

Looked at a few places and haven't found an obvious way to do this. Does anyone know if anyone has done something that handles such a use case? Could be either incrementally update the metadata files or make edits to/combine a few of them. Any suggestions will be appreciated.

2

There are 2 answers

0
Shi Fan On BEST ANSWER

This problem is specific to the fastparquet engine (works fine in pyarrow).

7
Krishan On

Dask's to_parquet() method has an append mode which I think is exactly what you want here:

append : bool, optional

    If False (default), construct data-set from scratch.
    If True, add new row-group(s) to an existing data-set.
    In the latter case, the data-set must exist, and the schema must match the input data.

I have used this successfully with the pyarrow engine, version 1.0.1