How does avro partition pruning work internally?

636 views Asked by At

I have a daily job which converts avro to parquet. Avro file per hour is 20G and is partitioned by year, month, day and hour when I read the avro file like the way below, spark.read.format("com.databricks.spark.avro").load(basePath).where($year=2020 and $month=9 and $day=1 and $hour=1).write.paritionBy(paritionCol).parquet(path) - the job runs for 1.5 hours Note: The whole folder basePath has 36 TB of data in avro format

But, the below command runs for just 7 minutes for the same spark configuration(memory and instances etc.). spark.read.format("com.databricks.spark.avro").option("basePath", basePath).load(basePath + "year=2020/month=9/day=1/hour=1/").write.paritionBy(paritionCol).parquet(path). Why there is such a drastic reduction of time? How avro does partition pruning internally?

1

There are 1 answers

3
maxime G On

there are a huge difference.

In the first case you will read all file then filter, in the second case you will read only the selected file (the filter is already done by the partitioning).

you can inspect if the filter is predicate pushdown by using explain() function. In your FileScan avro you will see PushedFilters and PartitionFilters

In your case, your filter is not predicate pushdown.

You can find more informations here : https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-Optimizer-PushDownPredicate.html