I am working on a cluster composed by 4 instances EC2 r3.2xlarge. I use spark 1.3.
val test = clt.rdd.groupBy { r: Row =>
val clt = r.get(0)
clt
}
clt is a DataFrame and it comes from a csv file of 8.5Go composed by 200million of lines.
On the Spark interface I can see that my groupBy run over 220 partitions and I can also see "Shuffle spill (memory)" is more than 4TB. VM options : -Xms80g -Xmx80g
My questions are :
Why is the spill memory so large?
How can I optimize this?
I already tried to clt.rdd.repartition(1200) and I get the same result but this time on repartition task (shuffle spill memory really large and query really slow).
EDIT
I found something "weird" :
I have a DataFrame name test which contains 5 columns.
This code run in 5/10mins :
val test1 = test.rdd.map {
row =>
val a = row.get(0)
val b = row.get(1)
val c = row.get(2)
val d = row.get(3)
val e = row.get(5)
(a, Array(a, b, c, d, e))
}.groupByKey
This code run in 3/5hours (and generate large amount of shuffle spill memory) :
val test1 = test.rdd.map {
row =>
val a = row.get(0)
(a, row)
}.groupByKey
Any idea why?