Spark tasks doesn't seem to be well distributed

5.5k views Asked by At

I am running a Spark job, and it seems that the tasks are not well distributed (see attached). Is there a way to make the tasks more evenly distributed? Thanks!

enter image description here

4

There are 4 answers

0
KlwntSingh On

I think task are evenly distributed accross different workers because each task has different port number in address column.

0
C4stor On

Only looking at your screenshot, it's quite difficult to diagnostic something. However, there's two things you may want to consider :

  • Spark UI (as of 1.3.1, I didn't try in 1.4.0 yet) is only showing sum of stats for finished tasks. If you took this screenshot while your application was running, it's quite possible some tasks were running and simply didn't showed up yet in the stats !

  • On a given Spark stage, you can't have more tasks than data partition. Without more code, it's hard to tell, but you may want to use rdd.partition() function, typically you can use rdd.repartition(sparkContext.getConf.getInt("spark.executor.instances", defaultValueInt) to generate more partition before processing, and hence smooth the load over executors

1
Mikel Urkia On

Taking a closer look to the posted image, I can identify two main facts:

  • The number of tasks has been evenly distributed, with a maximum variation of 20 tasks.
  • The running time allocated for each executor differs significatively, from 3.0 mins (~80 tasks) to 17.0 min (~60 tasks).

This makes me wonder about the nature of your application. Are all the tasks equal or do some of them need more time to complete than others? If the tasks are heterogeneous, your issue needs to be looked more carefully. Imagine the following scenario:

  • Number of tasks: 20, where each one needs 10 seconds to finish except of the last one:

    Task 01: 10 seconds
    Task 02: 10 seconds
    Task 03: 10 seconds
    Task ...
    Task 20: 120 seconds
    
  • Number of executors: 4 (each with a single core)

If we had to evenly distribute the tasks, we would see that each executor would have to process 5 tasks in total. Taking into account that one executor is assigned with the 20th tasks, which needs 120 seconds to complete, the execution flow would be the following:

  • By the second 40, each executor would be able to complete the first 4 tasks, considering that the 20th task is left at the end.
  • By the second 50, each executor but one will have finished all of their tasks. The remaining executor would still be computing the 20th tasks, which would finish completing after 120 seconds.

At the end, the user interface would show a result similar to yours, with the number of tasks evenly distributed but not the actual computing time.

Executor 01 -> tasks completed: 5 -> time: 0:50 minutes
Executor 02 -> tasks completed: 5 -> time: 0:50 minutes
Executor 03 -> tasks completed: 5 -> time: 0:50 minutes
Executor 04 -> tasks completed: 5 -> time: 2:40 minutes

Although not the same, a similar thing might be happening in your situation.

0
Kshitij Kulshrestha On

If you want equal distribution, you can use partition functionality of spark while loading a file into RDD,

val ratings = sc.textFile(File_name,partition)

Like you have 10 nodes of 2 cores each, then you can have 20 partition value and likewise.