I have a daily job which converts avro to parquet.
Avro file per hour is 20G and is partitioned by year, month, day and hour
when I read the avro file like the way below,
spark.read.format("com.databricks.spark.avro").load(basePath).where($year=2020 and $month=9 and $day=1 and $hour=1).write.paritionBy(paritionCol).parquet(path)
- the job runs for 1.5 hours
Note: The whole folder basePath has 36 TB of data in avro format
But, the below command runs for just 7 minutes for the same spark configuration(memory and instances etc.).
spark.read.format("com.databricks.spark.avro").option("basePath", basePath).load(basePath + "year=2020/month=9/day=1/hour=1/").write.paritionBy(paritionCol).parquet(path)
.
Why there is such a drastic reduction of time? How avro does partition pruning internally?
there are a huge difference.
In the first case you will read all file then filter, in the second case you will read only the selected file (the filter is already done by the partitioning).
you can inspect if the filter is predicate pushdown by using
explain()
function. In your FileScan avro you will seePushedFilters
andPartitionFilters
In your case, your filter is not predicate pushdown.
You can find more informations here : https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-Optimizer-PushDownPredicate.html