Extracting column name and datatype from parquet file with python

Question

Extracting column name and datatype from parquet file with python

3.7k views Asked by Vorcry At 07 October 2020 at 00:42

I have hundreds of parquet files, I want to get the column name and associated data type into a list in Python. I know I can get the schema, it comes in this format:

COL_1: string
   -- field metadata --
   PARQUET:field_id: '34'
COL_2: int32
   -- field metadata --
   PARQUET:field_id: '35'

I just want:

COL_1 string
COL_2 int32

Original Q&A

There are 1 answers

**0x26res** · Answer 1 · 2020-10-13T12:34:45+00:00

In order to go from parquet to arrow (and vice versa), some meta data is added to the schema, under the PARQUET key

You can remove the meta data easily:

import pyarrow as pa
import pyarrow.parquet as pq

table = pa.Table.from_arrays(
    [pa.array([1,2]), pa.array(['foo', 'bar'])],
    schema=pa.schema({'COL1': pa.int32(), 'COL2': pa.string()})
)
pq.write_table(table, '/tmp/table.pq')
parquet_file = pq.ParquetFile('/tmp/table.pq')

schema = pa.schema(
    [f.remove_metadata() for f in parquet_file.schema_arrow])
schema

This will print:

COL1: int32
COL2: string

Bear in mind that if you start writing your own metadata, you'll want to only remove the meta data under the PARQUET key.

TechQA.

Extracting column name and datatype from parquet file with python

There are 1 answers

Related Questions in PYTHON-3.X

Related Questions in PARQUET

Related Questions in PYARROW

Popular Questions

Popular Tags

Trending Questions