When I create a parquet file with an array using pyarrow, the schema will have 2 added nodes nested in the array node item and list (in more recent versions of pyarrow item is now element).
First I'd like to understand why is that. My guess is that it's related to the Dremel encoding used in parquet.
But most important, how do I get the original schema from a parquet file? I've encountered some parquet files with different names for the 2 added nested types, so skipping them by names is problematic. I need this for integration of parquet in a DB I'm working on.
The following code generates a parquet file with an array of struct
arr = []
for i in range(3):
arr.append([{"val": 1, "key": 1}, {"val": 1, "key": 1}])
table = pyarrow.table({
"my_array": arr
})
file_path = "/data/minio/ne-bucket/repro/data.parquet"
writer = pq.ParquetWriter(file_path, table.schema)
writer.write_table(table)
writer.close()
parquet_file = pq.ParquetFile("/data/minio/ne-bucket/repro/data.parquet")
schema = parquet_file.schema
print(schema)
Here's the result, note the added list and element:
required group field_id=-1 schema {
optional group field_id=-1 my_array (List) {
repeated group field_id=-1 list {
optional group field_id=-1 element {
optional int64 field_id=-1 key;
optional int64 field_id=-1 val;
}
}
}
}