Pyarrow Dataset: : Does predicate pushdown is applied when filter is applied non-partition colulmns

36 views Asked by At

I have a dataset which is partitioned on year and month columns which are derived from a date column. Now, if I apply a filter on date column, would predicate push comes to picture - would Pyarrow engine able to understand that it could avoid reading unnecessary partitions when a filter on date column is applied. Or is that the predicate push applies only if filters contain partition columns ?

A dataset partitioned by year and month may look like on disk:

dataset_name/
  year=2023/
    month=01/
       data0.parquet
       data1.parquet
       ...
    month=02/
       data0.parquet
       data1.parquet
       ...
    month=03/
    ...
  year=2022/
    month=01/
    ...
  ...

Example Code

import pyarrow.dataset as ds

dataset = ds.dataset("parquet_dataset_partitioned", format="parquet",

                     partitioning="hive")
# does predicate push applies here ?
dataset.to_table(filter=ds.field("date") >= "2023-01-01").to_pandas()
1

There are 1 answers

0
A. Coady On BEST ANSWER

Yes, but. Predicate pushdown uses partitioning and statistics. So filtering on date should be significantly faster than other fields, because the data is organized by date. However, filtering on a partition field is likely to be an order of magnitude faster still.

I recommend experimenting with year, date, and another field to measure the difference. On a similar dataset, I saw a ~4x speedup on the equivalent of date, but a ~50x speedup on year.