How to override type inference for partition columns in Hive partitioned dataset using PyArrow?

666 views Asked by sean_scoggins At 30 September 2020 at 12:33

I have a partition column in my Hive-style partitioned parquet dataset (written by PyArrow from Pandas Dataframe) with an entry like "TYPE=3860877578". When trying to read from this dataset, I get an error:

ArrowInvalid: error parsing '3760212050' as scalar of type int32

This is the first partition key that won't fit into an int32 (i.e. there are smaller integer values in other partitions - I think the inference must be done on the first one encountered). It looks like it should be possible to override the inferred type (to int64 or even string) at the dataset level, but I can't figure out how to get there from here :-) So far I have been using the Pandas.read_parquet() interface, and passing down filters, columns, etc. to PyArrow. I think I will need to use the PyArrow APIs directly, but don't know where to start.

How can I tell PyArrow to treat this column as an int64 or string type instead of trying to infer the type?

Example of dataset partition values that causes this problem:

/mydataset/TYPE=12345/*.parquet
/mydataset/TYPE=3760212050/*.parquet

Code that reproduces the problem with Pandas 1.1.1 and PyArrow 1.0.1:

import pandas as pd
# pyarrow is available and used
df = pd.read_parquet("mydataset")

The issue can't be avoided by avoiding the problematic value with filtering because the partition values all appear to be parsed prior to filtering, i.e

import pandas as pd
# pyarrow is available and used
df = pd.read_parquet("mydataset", filters=[[('TYPE','=','12345')]])

I figured out since my original post that I can do what I want with the PyArrow API directly like this:

from pyarrow.dataset import HivePartitioning
from pyarrow.parquet import ParquetDataset
import pyarrow as pa
        
partitioning = HivePartitioning(pa.schema([("TYPE", pa.int64())]))

df = ParquetDataset("mydataset",
         filters=filters, 
         partitioning=partitioning,
         use_legacy_dataset=False).read_pandas().to_pandas()

I'd like to be able to pass that info down through the Pandas read_parquet() interface but it doesn't appear possible at this time.

Original Q&A

TechQA.

How to override type inference for partition columns in Hive partitioned dataset using PyArrow?

There are 0 answers

Related Questions in PANDAS

Related Questions in PARQUET

Related Questions in PYARROW

Popular Questions

Popular Tags

Trending Questions