Parquet-mr - Enabling dictionnary on column increase row group number

48 views Asked by At

I'm trying to write time series data in Parquet and to get more familiar with the configuration of Parquet. The schema of the parquet file :

Schema:
message fr.theialand.insitu.parquet.persistance.NumericalObservation {
  optional int64 date_begin;
  required int64 date_end;
  required double latitude;
  required double longitude;
  optional double altitude;
  optional double value;
  optional binary quality_flag (STRING);
  optional group additional_values (LIST) {
    repeated binary array (STRING);
  }
}

I'm using apache-mr to write the file using the following writer:

try (ParquetWriter<NumericalObservation> writer = AvroParquetWriter.<NumericalObservation>builder(HadoopOutputFile.fromPath(path,new Configuration()))
                    .withSchema(avroSchema)
                    .withWriteMode(ParquetFileWriter.Mode.OVERWRITE)
                    .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0)
                    .withCompressionCodec(CompressionCodecName.SNAPPY)
                    .withRowGroupSize(Long.valueOf(1024*1024))
                    .build())

This writer create parquet file with many row groups, all with a size much smaller than 1MB. Here is one example row group returned by parquet-cli meta command :

Row group 147:  count: 19783  0.10 B records  start: 152210  total: 1.869 kB
--------------------------------------------------------------------------------
                         type      encodings count     avg size   nulls   min / max
date_begin               INT64     S   D     19783     0.01 B     0       "1497197700000" / "1503132300000"
date_end                 INT64     S   D     19783     0.01 B     0       "1497198000000" / "1503132600000"
latitude                 DOUBLE    S _ R     19783     0.00 B     0       "13.256" / "13.256"
longitude                DOUBLE    S _ R     19783     0.00 B     0       "2.8409" / "2.8409"
altitude                 DOUBLE    S _ R     19783     0.00 B     0       "2.06" / "2.06"
value                    DOUBLE    S _ R     19783     0.07 B     0       "-0.0" / "9.81"
quality_flag             BINARY    S _ R     19783     0.00 B     0       "9" / "9"
additional_values.array  BINARY    S   D     19783     0.00 B     19783  

I have a very little understanding of the information that are returned. It would be great if someone could provide clarification, especially on the encodings information. Looking at the source code, I believe that date_begin and date_end columns are encoded using one of the delta encodings described here and no dictionary should be computed for these columns.

However, writer configuration .withDictionaryEncoding(true) is set by default. And if I set .withDictionaryEncoding("date_begin", false) and .withDictionaryEncoding("date_end", false), the file generated has only one row group.

Row group 0:  count: 2945376  0.04 B records  start: 4  total: 105.826 kB
--------------------------------------------------------------------------------
                         type      encodings count     avg size   nulls   min / max
date_begin               INT64     S   D     2945376   0.01 B     0       "631152000000" / "1514764500000"
date_end                 INT64     S   D     2945376   0.01 B     0       "631152300000" / "1514764800000"
latitude                 DOUBLE    S _ R     2945376   0.00 B     0       "13.256" / "13.256"
longitude                DOUBLE    S _ R     2945376   0.00 B     0       "2.8409" / "2.8409"
altitude                 DOUBLE    S _ R     2945376   0.00 B     0       "2.06" / "2.06"
value                    DOUBLE    S _ R     2945376   0.02 B     0       "-9999.9" / "15.22"
quality_flag             BINARY    S _ R     2945376   0.00 B     0       "11" / "9"
additional_values.array  BINARY    S   D     2945376   0.00 B     2945376 

Encodings information looks similar in both case. Can someone provide explanation on what is happening?

  1. I would like to understand what trigger the creation of new row group in the first case since all row groups are sized under the 1MB.
  2. The second configuration seems to fit my data since dates are incremental. But I don't understand why preventing dictionary creation on date column has an effect since both column seems to be already delta encoded.
0

There are 0 answers