We want to collect details of spark job execution, for instance, executorMemory, executorRunTime, executorShuffleTime etc. which is displayed in Web UI of spark after a job has completed.
There are a few papers that relate to this and have created dataset of 100's of GBs. Links below:
However, we couldn't find datasets or snapshots of the data they used.
Where can I find data about spark jobs task metrics?
We used sparkMeasure( https://github.com/LucaCanali/sparkMeasure ) to generate dataset on our own, but doing this is taking too much time. We ran several algorithms such as Kmeans, PageRank, sorting, Linear Regression, etc and measure the task metrics for each using sparkMeasure
# Initialize the spark context. spark = SparkSession\ .builder\ .appName("PythonPageRank")\ .getOrCreate() taskmetrics = TaskMetrics(spark) taskmetrics.begin() lines = spark.read.text(os.path.join("file:///usr/lib/spark/examples/src/main/python",sys.argv)).rdd.map(lambda r: r) . . .# some other code . . # Collects all URL ranks and dump them to console. for (link, rank) in ranks.collect(): print("%s has rank: %s." % (link, rank)) # end task metrics and write details to file taskmetrics.end() fp.write(taskmetrics.report())
I am unable to generate vast amount of data using Amazons AWS or Google cloud individually by running programs. I've attached a screenshot of data that I currently have. This was generated after running pagerank implementation provided in spark examples folder. It'd be awesome if someone could point me towards a similar exhaustive dataset. Thanks in advance.