How does avro partition pruning work internally?

Question

How does avro partition pruning work internally?

656 views Asked by Gladiator At 28 September 2020 at 14:21

I have a daily job which converts avro to parquet. Avro file per hour is 20G and is partitioned by year, month, day and hour when I read the avro file like the way below, spark.read.format("com.databricks.spark.avro").load(basePath).where($year=2020 and $month=9 and $day=1 and $hour=1).write.paritionBy(paritionCol).parquet(path) - the job runs for 1.5 hours Note: The whole folder basePath has 36 TB of data in avro format

But, the below command runs for just 7 minutes for the same spark configuration(memory and instances etc.). spark.read.format("com.databricks.spark.avro").option("basePath", basePath).load(basePath + "year=2020/month=9/day=1/hour=1/").write.paritionBy(paritionCol).parquet(path). Why there is such a drastic reduction of time? How avro does partition pruning internally?

Original Q&A

There are 1 answers

**maxime G** · Answer 1 · 2020-09-28T14:27:04+00:00

there are a huge difference.

In the first case you will read all file then filter, in the second case you will read only the selected file (the filter is already done by the partitioning).

you can inspect if the filter is predicate pushdown by using explain() function. In your FileScan avro you will see PushedFilters and PartitionFilters

In your case, your filter is not predicate pushdown.

You can find more informations here : https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-Optimizer-PushDownPredicate.html

TechQA.

How does avro partition pruning work internally?

There are 1 answers

Related Questions in APACHE-SPARK

Related Questions in PARQUET

Related Questions in SPARK-AVRO

Popular Questions

Popular Tags

Trending Questions