I have two main questions about apache parquet file format:

  1. In apache website it is mentioned that 'The pages share a common header and readers can skip over page they are not interested in'. in this case? how you may have some pages (of the same column of the same row group) compressed and some not? how you may have different statistics on different groups? Also, in the row group header file there is only offset of the first page in each column, in this case how do you determine the offset of other pages? how can you skip the pages to move onto next one when you dont have any other info about other pages?

  2. in the row group metadata there is a metadata for each column. the following is also the structure of the file mentioned in apache website:

    4-byte magic number "PAR1"

    Column 1 Chunk 1 + Column Metadata

    Column 2 Chunk 1 + Column Metadata

    ...

    Column N Chunk 1 + Column Metadata

    Column 1 Chunk 2 + Column Metadata

    Column 2 Chunk 2 + Column Metadata

    ...

    Column N Chunk 2 + Column Metadata ...

    Column 1 Chunk M + Column Metadata

    Column 2 Chunk M + Column Metadata

    ...

    Column N Chunk M + Column Metadata

    File Metadata

    4-byte length in bytes of file metadata

    4-byte magic number "PAR1"

I am wondering what is the difference between the column metadata mentioned in the row group metadata and the column metadata mentioned in the above? from the above, it seems each column is followed with its metadata? is my understanding correct? if so, is this metadata a duplicate of the one in the row group header? or is this metadata the one for page metadata?

can someone post a parquet file, with its data and exact format, in plain text (not in binary) so i understand how exactly the file format is? thanks a lot.

0 Answers