How to override type inference for partition columns in Hive partitioned dataset using PyArrow?

683 views Asked by At

I have a partition column in my Hive-style partitioned parquet dataset (written by PyArrow from Pandas Dataframe) with an entry like "TYPE=3860877578". When trying to read from this dataset, I get an error:

ArrowInvalid: error parsing '3760212050' as scalar of type int32

This is the first partition key that won't fit into an int32 (i.e. there are smaller integer values in other partitions - I think the inference must be done on the first one encountered). It looks like it should be possible to override the inferred type (to int64 or even string) at the dataset level, but I can't figure out how to get there from here :-) So far I have been using the Pandas.read_parquet() interface, and passing down filters, columns, etc. to PyArrow. I think I will need to use the PyArrow APIs directly, but don't know where to start.

How can I tell PyArrow to treat this column as an int64 or string type instead of trying to infer the type?

Example of dataset partition values that causes this problem:

/mydataset/TYPE=12345/*.parquet
/mydataset/TYPE=3760212050/*.parquet

Code that reproduces the problem with Pandas 1.1.1 and PyArrow 1.0.1:

import pandas as pd
# pyarrow is available and used
df = pd.read_parquet("mydataset")

The issue can't be avoided by avoiding the problematic value with filtering because the partition values all appear to be parsed prior to filtering, i.e

import pandas as pd
# pyarrow is available and used
df = pd.read_parquet("mydataset", filters=[[('TYPE','=','12345')]])

I figured out since my original post that I can do what I want with the PyArrow API directly like this:

from pyarrow.dataset import HivePartitioning
from pyarrow.parquet import ParquetDataset
import pyarrow as pa
        
partitioning = HivePartitioning(pa.schema([("TYPE", pa.int64())]))

df = ParquetDataset("mydataset",
         filters=filters, 
         partitioning=partitioning,
         use_legacy_dataset=False).read_pandas().to_pandas()

I'd like to be able to pass that info down through the Pandas read_parquet() interface but it doesn't appear possible at this time.

0

There are 0 answers