PySpark Write Parquet Binary Column with Stats (signed-min-max.enabled)

1.8k views Asked by At

I found this apache-parquet ticket https://issues.apache.org/jira/browse/PARQUET-686 which is marked as resolved for parquet-mr 1.8.2. The feature I want is the calculated min/max in the parquet metadata for a (string or BINARY) column.

And referencing this is an email https://lists.apache.org/thread.html/%3CCANPCBc2UPm+oZFfP9oT8gPKh_v0_BF0jVEuf=Q3d-5=ugxSFbQ@mail.gmail.com%3E which uses scala instead of pyspark as an example:

     Configuration conf = new Configuration();
        + conf.set("parquet.strings.signed-min-max.enabled", "true");
     Path inputPath = new Path(input);
     FileStatus inputFileStatus =
       inputPath.getFileSystem(conf).getFileStatus(inputPath);
     List<Footer> footers = ParquetFileReader.readFooters(conf, inputFileStatus, false);

I've been unable to set this value in pyspark (perhaps I'm setting it in the wrong place?)


example dataframe

import random
import string
from pyspark.sql.types import StringType    

r = []
for x in range(2000):
    r.append(u''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(10)))

df = spark.createDataFrame(r, StringType())

I've tried a few different ways of setting this option:

df.write.format("parquet").option("parquet.strings.signed-min-max.enabled", "true").save("s3a://test.bucket/option")
df.write.option("parquet.strings.signed-min-max.enabled", "true").parquet("s3a://test.bucket/option")
df.write.option("parquet.strings.signed-min-max.enabled", True).parquet("s3a://test.bucket/option")

But all of the saved parquet files are missing the ST/STATS for the BINARY column. Here is an example output of the metadata from one of the parquet files:

creator:     parquet-mr version 1.8.3 (build aef7230e114214b7cc962a8f3fc5aeed6ce80828)
extra:       org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"value","type":"string","nullable":true,"metadata":{}}]}

file schema: spark_schema
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
value:       OPTIONAL BINARY O:UTF8 R:0 D:1

row group 1: RC:33 TS:515
---------------------------------------------------------------------------------------------------

Also, based on this email chain https://mail-archives.apache.org/mod_mbox/spark-user/201410.mbox/%[email protected]%3E and question: Specify Parquet properties pyspark

I tried sneaking the config in through the pyspark private API:

spark.sparkContext._jsc.hadoopConfiguration().setBoolean("parquet.strings.signed-min-max.enabled", True)

So I am still unable to set this conf parquet.strings.signed-min-max.enabled in parquet-mr (or it is set, but something else has gone wrong)

  1. Is it possible to configure parquet-mr from pyspark
  2. Does pyspark 2.3.x support BINARY column stats?
  3. How do I take advantage of the PARQUET-686 feature to add min/max metadata for string columns in a parquet file?
1

There are 1 answers

0
Zoltan On BEST ANSWER

Since historically Parquet writers wrote wrong min/max values for UTF-8 strings, new Parquet implementations skip those stats during reading, unless parquet.strings.signed-min-max.enabled is set. So this setting is a read option that tells the Parquet library to trust the min/max values in spite of their known deficiency. The only case when this setting can be safely enabled is if the strings only contain ASCII characters, because the corresponding bytes for those will never be negative.

Since you use parquet-tools for dumping the statistics and parquet-tools itself uses the Parquet library, it will ignore string min/max statistics by default. Although it seems that there are no min/max values in the file, in reality they are there, but get ignored.

The proper solution for this problem is PARQUET-1025, which introduces new statistics fields min-value and max-value. These handle UTF-8 strings correctly.