Best way to benchmark spark reading time

296 views Asked by At

What is the best way to benchmark the reading time of spark ?

    val rdd = spark.sparkContext.binaryFiles(s"$Path//$partitionColumn=$partitionId/*.avro")
implicit val streamEncoder: Encoder[(String, PortableDataStream)] = Encoders.kryo[(String, PortableDataStream)]
spark.createDataset(rdd)

I use spark 2.2

1

There are 1 answers

0
meniluca On

I suggest to use this library: https://github.com/LucaCanali/sparkMeasure.

Check examples available in the Readme file. Like this Databrick notebook.

For instance you could read your Avro using the runAndMeasure function:

taskMetrics.runAndMeasure(spark.createDataset(rdd).count())