Get orignal schema from Parquet files

22 views Asked by davdsb At 11 March 2024 at 09:37

When I create a parquet file with an array using pyarrow, the schema will have 2 added nodes nested in the array node item and list (in more recent versions of pyarrow item is now element).

First I'd like to understand why is that. My guess is that it's related to the Dremel encoding used in parquet.

But most important, how do I get the original schema from a parquet file? I've encountered some parquet files with different names for the 2 added nested types, so skipping them by names is problematic. I need this for integration of parquet in a DB I'm working on.

The following code generates a parquet file with an array of struct

arr = []
for i in range(3):
    arr.append([{"val": 1, "key": 1}, {"val": 1, "key": 1}])

table = pyarrow.table({
    "my_array": arr
})

file_path = "/data/minio/ne-bucket/repro/data.parquet"
writer = pq.ParquetWriter(file_path, table.schema)
writer.write_table(table)
writer.close()
parquet_file = pq.ParquetFile("/data/minio/ne-bucket/repro/data.parquet")
schema = parquet_file.schema
print(schema)

Here's the result, note the added list and element:

required group field_id=-1 schema {
  optional group field_id=-1 my_array (List) {
    repeated group field_id=-1 list {
      optional group field_id=-1 element {
        optional int64 field_id=-1 key;
        optional int64 field_id=-1 val;
      }
    }
  }
}

Original Q&A

TechQA.

Get orignal schema from Parquet files

There are 0 answers

Related Questions in DATABASE

Related Questions in PARQUET

Related Questions in PYARROW

Popular Questions

Trending Questions