Pyarrow Dataset: : Does predicate pushdown is applied when filter is applied non-partition colulmns

Question

Pyarrow Dataset: : Does predicate pushdown is applied when filter is applied non-partition colulmns

36 views Asked by Scarface At 21 March 2024 at 09:41

I have a dataset which is partitioned on year and month columns which are derived from a date column. Now, if I apply a filter on date column, would predicate push comes to picture - would Pyarrow engine able to understand that it could avoid reading unnecessary partitions when a filter on date column is applied. Or is that the predicate push applies only if filters contain partition columns ?

A dataset partitioned by year and month may look like on disk:

dataset_name/
  year=2023/
    month=01/
       data0.parquet
       data1.parquet
       ...
    month=02/
       data0.parquet
       data1.parquet
       ...
    month=03/
    ...
  year=2022/
    month=01/
    ...
  ...

Example Code

import pyarrow.dataset as ds

dataset = ds.dataset("parquet_dataset_partitioned", format="parquet",

                     partitioning="hive")
# does predicate push applies here ?
dataset.to_table(filter=ds.field("date") >= "2023-01-01").to_pandas()

Original Q&A

There are 1 answers

**A. Coady** · Accepted Answer · 2024-03-22T00:34:03+00:00

Yes, but. Predicate pushdown uses partitioning and statistics. So filtering on date should be significantly faster than other fields, because the data is organized by date. However, filtering on a partition field is likely to be an order of magnitude faster still.

I recommend experimenting with year, date, and another field to measure the difference. On a similar dataset, I saw a ~4x speedup on the equivalent of date, but a ~50x speedup on year.

TechQA.

Pyarrow Dataset: : Does predicate pushdown is applied when filter is applied non-partition colulmns

A dataset partitioned by year and month may look like on disk:

Example Code

There are 1 answers

Related Questions in PYARROW

Popular Questions

Trending Questions