Extracting column name and datatype from parquet file with python

3.7k views Asked by At

I have hundreds of parquet files, I want to get the column name and associated data type into a list in Python. I know I can get the schema, it comes in this format:

COL_1: string
   -- field metadata --
   PARQUET:field_id: '34'
COL_2: int32
   -- field metadata --
   PARQUET:field_id: '35'

I just want:

COL_1 string
COL_2 int32
1

There are 1 answers

0
0x26res On

In order to go from parquet to arrow (and vice versa), some meta data is added to the schema, under the PARQUET key

You can remove the meta data easily:

import pyarrow as pa
import pyarrow.parquet as pq

table = pa.Table.from_arrays(
    [pa.array([1,2]), pa.array(['foo', 'bar'])],
    schema=pa.schema({'COL1': pa.int32(), 'COL2': pa.string()})
)
pq.write_table(table, '/tmp/table.pq')
parquet_file = pq.ParquetFile('/tmp/table.pq')

schema = pa.schema(
    [f.remove_metadata() for f in parquet_file.schema_arrow])
schema

This will print:

COL1: int32
COL2: string

Bear in mind that if you start writing your own metadata, you'll want to only remove the meta data under the PARQUET key.