I am running a Spark job, and it seems that the tasks are not well distributed (see attached). Is there a way to make the tasks more evenly distributed? Thanks!

I am running a Spark job, and it seems that the tasks are not well distributed (see attached). Is there a way to make the tasks more evenly distributed? Thanks!

On
Only looking at your screenshot, it's quite difficult to diagnostic something. However, there's two things you may want to consider :
Spark UI (as of 1.3.1, I didn't try in 1.4.0 yet) is only showing sum of stats for finished tasks. If you took this screenshot while your application was running, it's quite possible some tasks were running and simply didn't showed up yet in the stats !
On a given Spark stage, you can't have more tasks than data partition. Without more code, it's hard to tell, but you may want to use rdd.partition() function, typically you can use rdd.repartition(sparkContext.getConf.getInt("spark.executor.instances", defaultValueInt) to generate more partition before processing, and hence smooth the load over executors
On
Taking a closer look to the posted image, I can identify two main facts:
This makes me wonder about the nature of your application. Are all the tasks equal or do some of them need more time to complete than others? If the tasks are heterogeneous, your issue needs to be looked more carefully. Imagine the following scenario:
Number of tasks: 20, where each one needs 10 seconds to finish except of the last one:
Task 01: 10 seconds
Task 02: 10 seconds
Task 03: 10 seconds
Task ...
Task 20: 120 seconds
If we had to evenly distribute the tasks, we would see that each executor would have to process 5 tasks in total. Taking into account that one executor is assigned with the 20th tasks, which needs 120 seconds to complete, the execution flow would be the following:
At the end, the user interface would show a result similar to yours, with the number of tasks evenly distributed but not the actual computing time.
Executor 01 -> tasks completed: 5 -> time: 0:50 minutes
Executor 02 -> tasks completed: 5 -> time: 0:50 minutes
Executor 03 -> tasks completed: 5 -> time: 0:50 minutes
Executor 04 -> tasks completed: 5 -> time: 2:40 minutes
Although not the same, a similar thing might be happening in your situation.
I think task are evenly distributed accross different workers because each task has different port number in address column.