What is the best way to benchmark the reading time of spark ?
val rdd = spark.sparkContext.binaryFiles(s"$Path//$partitionColumn=$partitionId/*.avro")
implicit val streamEncoder: Encoder[(String, PortableDataStream)] = Encoders.kryo[(String, PortableDataStream)]
spark.createDataset(rdd)
I use spark 2.2
I suggest to use this library: https://github.com/LucaCanali/sparkMeasure.
Check examples available in the Readme file. Like this Databrick notebook.
For instance you could read your Avro using the
runAndMeasurefunction: