parquet-tools cannot read zstd files but can read gzip?

409 views Asked by At

I installed the latest version of parquet-tools from apache-mr with version parquet-tools-1.8.2.jar.

Here is a reproducible example:

>>> import boto3
>>> client = GET_CLIENT() # redacted
>>> import pandas as pd
>>> df = pd.DataFrame([[1,2,3]], columns=["a","b","c"])
>>> df
   a  b  c
0  1  2  3
>>> from io import BytesIO
>>> filebuf = BytesIO()
>>> df.to_parquet(filebuf, compression="zstd") # Change this to gzip and it works!
>>> client.put_object(Bucket="foo", Key="bar/example.zstd.parquet", Body=filebuf.getvalue())

I aws s3 cp'd the parquet file and tried to run parquet-tools head on it, but got:

$ parquet-tools head example.zstd.parquet
Could not read footer: java.lang.NullPointerException

However, doing the same command on a gzip-compressed file gives me:

$ parquet-tools head example.gzip.parquet
a = 1
b = 2
c = 3

Is this a bug with zstd compression or parquet-tools? Or did I not read the fineprint somewhere?

NOTE: My parquet-tools is aliased to java -jar .../parquet-tools-1.8.2.jar

0

There are 0 answers