I have a dataset which is partitioned on year and month columns which are derived from a date column. Now, if I apply a filter on date column, would predicate push comes to picture - would Pyarrow engine able to understand that it could avoid reading unnecessary partitions when a filter on date column is applied. Or is that the predicate push applies only if filters contain partition columns ?
A dataset partitioned by year and month may look like on disk:
dataset_name/
year=2023/
month=01/
data0.parquet
data1.parquet
...
month=02/
data0.parquet
data1.parquet
...
month=03/
...
year=2022/
month=01/
...
...
Example Code
import pyarrow.dataset as ds
dataset = ds.dataset("parquet_dataset_partitioned", format="parquet",
partitioning="hive")
# does predicate push applies here ?
dataset.to_table(filter=ds.field("date") >= "2023-01-01").to_pandas()
Yes, but. Predicate pushdown uses partitioning and statistics. So filtering on
dateshould be significantly faster than other fields, because the data is organized bydate. However, filtering on a partition field is likely to be an order of magnitude faster still.I recommend experimenting with
year,date, and another field to measure the difference. On a similar dataset, I saw a ~4x speedup on the equivalent ofdate, but a ~50x speedup onyear.