I installed the latest version of parquet-tools
from apache-mr
with version parquet-tools-1.8.2.jar
.
Here is a reproducible example:
>>> import boto3
>>> client = GET_CLIENT() # redacted
>>> import pandas as pd
>>> df = pd.DataFrame([[1,2,3]], columns=["a","b","c"])
>>> df
a b c
0 1 2 3
>>> from io import BytesIO
>>> filebuf = BytesIO()
>>> df.to_parquet(filebuf, compression="zstd") # Change this to gzip and it works!
>>> client.put_object(Bucket="foo", Key="bar/example.zstd.parquet", Body=filebuf.getvalue())
I aws s3 cp
'd the parquet file and tried to run parquet-tools head
on it, but got:
$ parquet-tools head example.zstd.parquet
Could not read footer: java.lang.NullPointerException
However, doing the same command on a gzip-compressed file gives me:
$ parquet-tools head example.gzip.parquet
a = 1
b = 2
c = 3
Is this a bug with zstd compression or parquet-tools? Or did I not read the fineprint somewhere?
NOTE: My parquet-tools
is aliased to java -jar .../parquet-tools-1.8.2.jar