I have hundreds of parquet files, I want to get the column name and associated data type into a list in Python. I know I can get the schema, it comes in this format:
COL_1: string
-- field metadata --
PARQUET:field_id: '34'
COL_2: int32
-- field metadata --
PARQUET:field_id: '35'
I just want:
COL_1 string
COL_2 int32
In order to go from parquet to arrow (and vice versa), some meta data is added to the schema, under the
PARQUET
keyYou can remove the meta data easily:
This will print:
Bear in mind that if you start writing your own metadata, you'll want to only remove the meta data under the
PARQUET
key.