How to Handle Growing _metadata File Size and Avoid Corruption in Amazon Redshift Spectrum Parquet Append

47 views Asked by At

Context: Our web application generates a lot of log files that arrive in an S3 bucket. The files in the bucket contain JSON strings and have a .txt file format. We process these files in chunks of 200 files per loop, which involves several steps, including reading, extracting data, saving to Amazon Redshift, and saving the initial DataFrame in Parquet format in another S3 bucket "spectrum_Data/api_logs/".

Steps:

  1. Reading 200 files from the S3 bucket in Python Pandas.

  2. Extracting the required data from the initial DataFrame and saving it to an Amazon Redshift table.

  3. Saving the initial DataFrame in Apache Parquet format with partitioning by the date column using the provided code:

    dataframe.to_parquet(path=s3_spectrum_loc, engine='fastparquet', index=False, partition_cols=['log_date'], times='int96', append=True)

  4. We use the Parquet format to enable querying in Amazon Redshift Spectrum's external table linked to the S3 bucket.

Problem: The issue arises in step 3 when saving the data in Parquet format. The "dataframe.to_parquet" function generates two additional files in the S3 location "_common_metadata" and "_metadata." The "_metadata" file keeps growing in size, and when it exceeds 70 MB, it becomes corrupted. This results in a loop where the process tries to read the corrupted file, preventing further data appending.

Question: Is there a solution to handle the error or limit the size of the "_metadata" file when appending data using the "dataframe.to_parquet" function? Alternatively, is there a way to skip generating the "_metadata" file while still using the append method, ensuring that existing data is not replaced?

Any insights, code examples, or recommendations to overcome this issue would be greatly appreciated.

Thank, Aakash

1

There are 1 answers

1
mdurant On

I would be interested to know why or how the file is getting corrupted, it's probably something that can be fixed.

In any case, the _metadata file is not required for reading the data, and can be deleted - it merely provides a possible optimisation at read time. parquet written via dask, for example, no longer creates this file (for any engine). The _common_metadata file is much smaller and probably not causing you any problem.

One simple thing you can try to mitigate the current problem is passing stats=False to to_parquet, which will reduce the size of the metadata somewhat.

Another simple fix would be to not use append=True, but to write each batch to a separate file in the same directory. That will load fine too so long as the schema doesn't change.