Spark Dataframe - groupBy - shuffle spill memory

1.9k views Asked by GermainGum At 15 June 2015 at 15:20

I am working on a cluster composed by 4 instances EC2 r3.2xlarge. I use spark 1.3.

val test = clt.rdd.groupBy { r: Row =>
  val clt = r.get(0)
  clt
}

clt is a DataFrame and it comes from a csv file of 8.5Go composed by 200million of lines.

On the Spark interface I can see that my groupBy run over 220 partitions and I can also see "Shuffle spill (memory)" is more than 4TB. VM options : -Xms80g -Xmx80g

My questions are :

Why is the spill memory so large?
How can I optimize this?

I already tried to clt.rdd.repartition(1200) and I get the same result but this time on repartition task (shuffle spill memory really large and query really slow).

EDIT

I found something "weird" :

I have a DataFrame name test which contains 5 columns.

This code run in 5/10mins :

 val test1 = test.rdd.map {
  row =>
    val a = row.get(0)
    val b = row.get(1)
    val c = row.get(2)
    val d = row.get(3)
    val e = row.get(5)
    (a, Array(a, b, c, d, e))
}.groupByKey

This code run in 3/5hours (and generate large amount of shuffle spill memory) :

val test1 = test.rdd.map {
  row =>
    val a = row.get(0)
    (a, row)
}.groupByKey

Any idea why?

Original Q&A

TechQA.

Spark Dataframe - groupBy - shuffle spill memory

There are 0 answers

Related Questions in APACHE-SPARK

Related Questions in SHUFFLE

Related Questions in APACHE-SPARK-SQL

Popular Questions

Popular Tags

Trending Questions