Unable to load parquet files with same columns names but with a different order.

Scenario:

ABD-MacBook-Pro:ttt abd$ tree
.
├── testing1.paquet
└── testing2.paquet

I have two parquet files as mentioned above. The column names are the same in both the files but just the order is different and I was able to load these files using Spark. Could you please let me know if I miss anything here? or is this not supported by pyarrow?

I'm trying to load those parquet files using the below command.

pandas_df = pq.ParquetDataset('ttt', filesystem=file_system).read_pandas().to_pandas()

Getting the below error on running above command.

ValueError: Schema in ttt//testing2.paquet was different.

C1: string
C2: string
C3: string
C4: string
Unnamed: 4: double
Unnamed: 5: double
Unnamed: 6: double
__index_level_0__: int64
metadata
--------
{b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": [{"na'
            b'me": null, "field_name": null, "pandas_type": "unicode", "numpy_'
            b'type": "object", "metadata": {"encoding": "UTF-8"}}], "columns":'
            b' [{"name": "C1", "field_name": "C1", "pandas_type": "unicode", "'
            b'numpy_type": "object", "metadata": null}, {"name": "C2", "field_'
            b'name": "C2", "pandas_type": "unicode", "numpy_type": "object", "'
            b'metadata": null}, {"name": "C3", "field_name": "C3", "pandas_typ'
            b'e": "unicode", "numpy_type": "object", "metadata": null}, {"name'
            b'": "C4", "field_name": "C4", "pandas_type": "unicode", "numpy_ty'
            b'pe": "object", "metadata": null}, {"name": "Unnamed: 4", "field_'
            b'name": "Unnamed: 4", "pandas_type": "float64", "numpy_type": "fl'
            b'oat64", "metadata": null}, {"name": "Unnamed: 5", "field_name": '
            b'"Unnamed: 5", "pandas_type": "float64", "numpy_type": "float64",'
            b' "metadata": null}, {"name": "Unnamed: 6", "field_name": "Unname'
            b'd: 6", "pandas_type": "float64", "numpy_type": "float64", "metad'
            b'ata": null}, {"name": null, "field_name": "__index_level_0__", "'
            b'pandas_type": "int64", "numpy_type": "int64", "metadata": null}]'
            b', "pandas_version": "0.23.0"}'}

vs

C1: string
C2: string
C4: string
C3: string
Unnamed: 4: double
Unnamed: 5: double
Unnamed: 6: double
__index_level_0__: int64
metadata
--------
{b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": [{"na'
            b'me": null, "field_name": null, "pandas_type": "unicode", "numpy_'
            b'type": "object", "metadata": {"encoding": "UTF-8"}}], "columns":'
            b' [{"name": "C1", "field_name": "C1", "pandas_type": "unicode", "'
            b'numpy_type": "object", "metadata": null}, {"name": "C2", "field_'
            b'name": "C2", "pandas_type": "unicode", "numpy_type": "object", "'
            b'metadata": null}, {"name": "C4", "field_name": "C4", "pandas_typ'
            b'e": "unicode", "numpy_type": "object", "metadata": null}, {"name'
            b'": "C3", "field_name": "C3", "pandas_type": "unicode", "numpy_ty'
            b'pe": "object", "metadata": null}, {"name": "Unnamed: 4", "field_'
            b'name": "Unnamed: 4", "pandas_type": "float64", "numpy_type": "fl'
            b'oat64", "metadata": null}, {"name": "Unnamed: 5", "field_name": '
            b'"Unnamed: 5", "pandas_type": "float64", "numpy_type": "float64",'
            b' "metadata": null}, {"name": "Unnamed: 6", "field_name": "Unname'
            b'd: 6", "pandas_type": "float64", "numpy_type": "float64", "metad'
            b'ata": null}, {"name": null, "field_name": "__index_level_0__", "'
            b'pandas_type": "int64", "numpy_type": "int64", "metadata": null}]'
            b', "pandas_version": "0.23.0"}'}

1 Answers

1
joris On Best Solutions

This is currently not yet supported by pyarrow. More specifically, the current limitation is that all schemas of different pieces / files need to be identical (not only order, but also type).

It's certainly the plan to improve this situation and have some schema normalization while reading parquet files (see eg https://issues.apache.org/jira/browse/ARROW-2659 about different types). For this specific issue, there is this JIRA issue https://issues.apache.org/jira/browse/ARROW-2366 that covers this case.