I have a partition column in my Hive-style partitioned parquet dataset (written by PyArrow from Pandas Dataframe) with an entry like "TYPE=3860877578". When trying to read from this dataset, I get an error:
ArrowInvalid: error parsing '3760212050' as scalar of type int32
This is the first partition key that won't fit into an int32 (i.e. there are smaller integer values in other partitions - I think the inference must be done on the first one encountered). It looks like it should be possible to override the inferred type (to int64 or even string) at the dataset level, but I can't figure out how to get there from here :-) So far I have been using the Pandas.read_parquet() interface, and passing down filters, columns, etc. to PyArrow. I think I will need to use the PyArrow APIs directly, but don't know where to start.
How can I tell PyArrow to treat this column as an int64 or string type instead of trying to infer the type?
Example of dataset partition values that causes this problem:
/mydataset/TYPE=12345/*.parquet
/mydataset/TYPE=3760212050/*.parquet
Code that reproduces the problem with Pandas 1.1.1 and PyArrow 1.0.1:
import pandas as pd
# pyarrow is available and used
df = pd.read_parquet("mydataset")
The issue can't be avoided by avoiding the problematic value with filtering because the partition values all appear to be parsed prior to filtering, i.e
import pandas as pd
# pyarrow is available and used
df = pd.read_parquet("mydataset", filters=[[('TYPE','=','12345')]])
I figured out since my original post that I can do what I want with the PyArrow API directly like this:
from pyarrow.dataset import HivePartitioning
from pyarrow.parquet import ParquetDataset
import pyarrow as pa
partitioning = HivePartitioning(pa.schema([("TYPE", pa.int64())]))
df = ParquetDataset("mydataset",
filters=filters,
partitioning=partitioning,
use_legacy_dataset=False).read_pandas().to_pandas()
I'd like to be able to pass that info down through the Pandas read_parquet() interface but it doesn't appear possible at this time.