I am writing out a compressed Parquet file from DataFrame as following:
result_df.to_parquet("my-data.parquet", compression="zstd")
How can I instruct Pandas on the compression level of zstd coding?
On
pandas 2.1.0 pyarrow 13.0.0 fastparquet 2023.8.0
In Pandas 2.1.0, we can use two different libraries as engines to write parquet files - pyarrow and fastparquet. They have different ways to address a compression level, which are generally incompatible. So in both cases, it is better to explicitly specify the value of the engine parameter in order to avoid any confusion.
In case of pyarrow, which is a default option in pandas 2.1.0, we have to pass to the engine the compression_level parameter, which is described in the pyarrow documentation:
compression_level :
intordict, defaultNone
Specify the compression level for a codec, either on a general basis or per-column. If None is passed, arrow selects the compression level for the compression codec in use. The compression level has a different meaning for each codec, so you have to read the documentation of the codec you are using. An exception is thrown if the compression codec does not allow specifying a compression level.
This parameter will be passed to the pyarrow parquet writer as an item of the kwargs dict of the to_parquet method.
In case of fastparquet, the compression level is passed as a part of the compression parameter, which in this case should look like
{
column_name: {
'type': desired_compression_for_specified_column,
'args': kwargs_for_specified_compression_type
},
# ...
'_default': { # default compression type for non-specified columns
'type': 'zstd',
'args': {'level': desired_compression_level_for_zstandard}
}
}
Note that there's no compression_level parameter in the fastparquet.write method, so we can't use it as a parameter in pandas DataFrame.to_parquet with fastparquet as an engine. As of possible keywords in the 'args' dictionary fot Zstandard compression, see details of cramjam.zstd.compress, which is used underneath of fastparquet version 2023.8.0.
my_compression_level = 2
# with pyarrow
df.to_parquet(
'my-data.parquet',
engine='pyarrow',
compression='zstd',
compression_level=my_compression_level
)
# with fastparquet
df.to_parquet(
'my-data.parquet',
engine='fastparquet',
compression={
'_default': {
'type': 'zstd',
'args': {'level': my_compression_level}
}
}
)
On
I think, using the compression_opts parameter in the to_parquet function is preferable as it allows for defining compression options through a dictionary and the compression_level key specifically determines the compression level for zstd coding,so adjusting its value allows for balancing compression ratio and speed, with higher values yielding better compression but slower performance.The default value is 3.
result_df.to_parquet("my-data.parquet", compression="zstd", compression_opts={'compression_level': 10})
and this is a simple example for that:
import pandas as pd
data = {'A': [1, 2, 3, 4, 5],
'B': ['foo', 'bar', 'baz', 'qux', 'quux']}
df = pd.DataFrame(data)
#writing DataFrame to Parquet with zstd compression and compression level
df.to_parquet("my-data.parquet", compression="zstd", compression_opts={'compression_level': 10})
Using
pyarrowengine you can sendcompression_levelinkwargstoto_parquetTest:
Output: