How to compress parquet file with zstandard using pandas

4.9k views Asked by At

i'm using pandas to convert dataframes to .parquet files using this command:

df.to_parquet(file_name, engine='pyarrow', compression='gzip')

I need to use zstandard as compression algorithm, but the function above accepts only gzip, snappy, and brotli. I tried Is there a way to include zstd in this function? If not, how can i do that with other packages? I tried with zstandard, but it seems to accept only bytes-like objects.

4

There are 4 answers

0
Levi Sands On

I usually use zstandard as my compression algorithm for my dataframes.

This is the code I use (a bit simplified) to write those parquet files:

import pandas as pd
import pyarrow.parquet as pq
import pyarrow as pa

parquetFilename = "test.parquet"

df = pd.DataFrame(
    {
        "num_legs": [2, 4, 8, 0],
        "num_wings": [2, 0, 0, 0],
        "num_specimen_seen": [10, 2, 1, 8],
    },
    index=["falcon", "dog", "spider", "fish"],
)

df = pa.Table.from_pandas(df)
pq.write_table(df, parquetFilename, compression="zstd")

And to read these parquet files:

import pandas as pd
import pyarrow.parquet as pq
import pyarrow as pa

parquetFilename = "test.parquet"
df = pq.read_table(parquetFilename)
df = df.to_pandas()

For more details see these sites for more information:

Finally a shameless plug for a blog post I wrote. It is about the speed vs space balance of zstandard and snappy compression in parquet files using pyarrow. It is relevent to your question and includes some more "real world" code examples of reading and writing parquet files in zstandard. I will actually be writing a follow up soon too. if you're interested let me know.

0
John On

You can actually just use

df.to_parquet(file_name, engine='pyarrow', compression='zstd')

Note: Only pyarrow supports Zstandard compression, fastparquet does not.

Reading is even easier, since you don't have to name the compression algorithm:

df = pd.read_parquet(file_name)

Up to now (Pandas 1.5.3) it was documented only in the backend since Pandas 1.4.0. The missing documentation in the interface has been fixed in the current development version.

0
Istvan On

It seems it is not supported yet:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_parquet.html

compression{‘snappy’, ‘gzip’, ‘brotli’, None}, default ‘snappy’ Name of the compression to use. Use None for no compression.

0
A. West On

Dependencies: %pip install pandas[parquet, compression]>=1.4

Code: df.to_parquet(filepath, compression='zstd')

Documentation

  • Installed by "parquet": pyarrow is the default parquet/feather engine, fastarrow also exists.
  • Installed by "compression": Zstandard is only mentioned from pandas>=1.4 and in to_parquet from pandas>=2.1