Pandas DataFrame.write_parquet() and setting the Zstd compression level

667 views Asked by At

I am writing out a compressed Parquet file from DataFrame as following:

result_df.to_parquet("my-data.parquet", compression="zstd")

How can I instruct Pandas on the compression level of zstd coding?

3

There are 3 answers

0
Guy On BEST ANSWER

Using pyarrow engine you can send compression_level in kwargs to to_parquet

result_df.to_parquet(path, engine='pyarrow', compression='zstd', compression_level=1)

Test:

import pandas as pd
import pyarrow.parquet as pq

path = 'my-data.parquet'
result_df = pd.DataFrame({'a': range(100000)})

for i in range(10):

    # create the file
    result_df.to_parquet(path, engine='pyarrow', compression='zstd', compression_level=i)

    # get compressed file size
    metadata = pq.ParquetFile(path).metadata.row_group(0).column(0)
    print(f'compression level {i}: {metadata.total_compressed_size}')

Output:

compression level 0: 346166
compression level 1: 309501
compression level 2: 309500
compression level 3: 346166
compression level 4: 355549
compression level 5: 381823
compression level 6: 310104
compression level 7: 310088
compression level 8: 308866
compression level 9: 308866
0
Vitalizzare On

pandas 2.1.0 pyarrow 13.0.0 fastparquet 2023.8.0

How to set compression level in DataFrame.to_parquet

In Pandas 2.1.0, we can use two different libraries as engines to write parquet files - pyarrow and fastparquet. They have different ways to address a compression level, which are generally incompatible. So in both cases, it is better to explicitly specify the value of the engine parameter in order to avoid any confusion.

engine = 'pyarrow'

In case of pyarrow, which is a default option in pandas 2.1.0, we have to pass to the engine the compression_level parameter, which is described in the pyarrow documentation:

compression_level : int or dict, default None
Specify the compression level for a codec, either on a general basis or per-column. If None is passed, arrow selects the compression level for the compression codec in use. The compression level has a different meaning for each codec, so you have to read the documentation of the codec you are using. An exception is thrown if the compression codec does not allow specifying a compression level.

This parameter will be passed to the pyarrow parquet writer as an item of the kwargs dict of the to_parquet method.

engine = 'fastparquet'

In case of fastparquet, the compression level is passed as a part of the compression parameter, which in this case should look like

{
  column_name: {
    'type': desired_compression_for_specified_column, 
    'args': kwargs_for_specified_compression_type
  },
  # ...
  '_default': {    # default compression type for non-specified columns
    'type': 'zstd',
    'args': {'level': desired_compression_level_for_zstandard}
  }
}

Note that there's no compression_level parameter in the fastparquet.write method, so we can't use it as a parameter in pandas DataFrame.to_parquet with fastparquet as an engine. As of possible keywords in the 'args' dictionary fot Zstandard compression, see details of cramjam.zstd.compress, which is used underneath of fastparquet version 2023.8.0.

Sample code

my_compression_level = 2

# with pyarrow
df.to_parquet(
    'my-data.parquet', 
    engine='pyarrow', 
    compression='zstd', 
    compression_level=my_compression_level
)

# with fastparquet
df.to_parquet(
    'my-data.parquet', 
    engine='fastparquet', 
    compression={
        '_default': {
            'type': 'zstd', 
            'args': {'level': my_compression_level}
        }
    }
)
3
Freeman On

I think, using the compression_opts parameter in the to_parquet function is preferable as it allows for defining compression options through a dictionary and the compression_level key specifically determines the compression level for zstd coding,so adjusting its value allows for balancing compression ratio and speed, with higher values yielding better compression but slower performance.The default value is 3.

result_df.to_parquet("my-data.parquet", compression="zstd", compression_opts={'compression_level': 10})

and this is a simple example for that:

import pandas as pd

data = {'A': [1, 2, 3, 4, 5],
        'B': ['foo', 'bar', 'baz', 'qux', 'quux']}
df = pd.DataFrame(data)

#writing DataFrame to Parquet with zstd compression and compression level
df.to_parquet("my-data.parquet", compression="zstd", compression_opts={'compression_level': 10})