I am running a Spark streaming application with 2 workers. Application has a join and an union operations.
All the batches are completing successfully but noticed that shuffle spill metrics are not consistent with input data size or output data size (spill memory is more than 20 times).
Please find the spark stage details in the below image:
After researching on this, found that
Shuffle spill happens when there is not sufficient memory for shuffle data.
Shuffle spill (memory)
- size of the deserialized form of the data in memory at the time of spilling
shuffle spill (disk)
- size of the serialized form of the data on disk after spilling
Since deserialized data occupies more space than serialized data. So, Shuffle spill (memory) is more.
Noticed that this spill memory size is incredibly large with big input data.
My queries are:
Does this spilling impacts the performance considerably?
How to optimize this spilling both memory and disk?
Are there any Spark Properties that can reduce/ control this huge spilling?
Learning to performance-tune Spark requires quite a bit of investigation and learning. There are a few good resources including this video. Spark 1.4 has some better diagnostics and visualisation in the interface which can help you.
In summary, you spill when the size of the RDD partitions at the end of the stage exceed the amount of memory available for the shuffle buffer.
You can:
repartition()
your prior stage so that you have smaller partitions from input.spark.executor.memory
)spark.shuffle.memoryFraction
) from the default of 0.2. You need to give backspark.storage.memoryFraction
.SPARK_WORKER_CORES
) to executor memoryIf there is an expert listening, I would love to know more about how the memoryFraction settings interact and their reasonable range.