Is there a way to incrementally update Dask metadata file?

Question

Is there a way to incrementally update Dask metadata file?

498 views Asked by Shi Fan At 12 October 2020 at 19:53

I'm trying to process a dataset and make incremental updates as writing it out in Dask. The Dask metadata file would help a lot when it comes to rereading the processed data. However, as I write new partitions/subsets to the same path, the metadata there gets overwritten by the new partitions/subsets rather than updated with them included.

import dask.dataframe as dd

df = dd.read_parquet(read_path)
# some transformations
df = …
df.to_parquet(write_path, partition_on=[col1, col2, …], write_metadata_file=True)

Looked at a few places and haven't found an obvious way to do this. Does anyone know if anyone has done something that handles such a use case? Could be either incrementally update the metadata files or make edits to/combine a few of them. Any suggestions will be appreciated.

Original Q&A

There are 2 answers

Krishan On 13 October 2020 at 09:52

Dask's to_parquet() method has an append mode which I think is exactly what you want here:

append : bool, optional

    If False (default), construct data-set from scratch.
    If True, add new row-group(s) to an existing data-set.
    In the latter case, the data-set must exist, and the schema must match the input data.

I have used this successfully with the pyarrow engine, version 1.0.1

**Shi Fan** · Accepted Answer · 2020-10-16T14:48:37+00:00

Shi Fan On 16 October 2020 at 14:48 BEST ANSWER

This problem is specific to the fastparquet engine (works fine in pyarrow).

TechQA.

Is there a way to incrementally update Dask metadata file?

There are 2 answers

Related Questions in DASK

Related Questions in DASK-DISTRIBUTED

Related Questions in FASTPARQUET

Related Questions in DASK-DATAFRAME

Popular Questions

Popular Tags

Trending Questions