Preserving the number of partitions of a Spark dataframe after transformation

Question

Preserving the number of partitions of a Spark dataframe after transformation

1.5k views Asked by John Subas At 12 September 2017 at 17:21

I am looking at a bug in the code where a dataframe has been split into too many partitions than desired (over 700), and this causes too many shuffle operations when I try to repartition them to only 48. I can't use a coalesce() here because I want to have fewer partitions in the first place before I do a repartition.

I am looking at ways to reduce the number of partitions. Let's say I have a spark dataframe (with multiple columns) divided into 10 partitions. I need to do an orderBy transformation based on one of the columns. After this operation is done, will the resulting dataframe have the same number of partitions? If not, how would spark decide on the number of partitions?

Also what are other transformations that could cause a change in the number of partitions for a dataframe, that I need to be aware of, other than the obvious ones like repartition()?

Original Q&A

There are 1 answers

**Alper t. Turker** · Answer 1 · 2017-09-12T17:53:03+00:00

Number of partitions for operations requiring exchange is defined by spark.sql.shuffle.partitions. If you want a particular value you should set it before executing the command:

scala> val df = spark.range(0, 1000)
df: org.apache.spark.sql.Dataset[Long] = [id: bigint]

scala> spark.conf.set("spark.sql.shuffle.partitions", 1)

scala> df.orderBy("id").rdd.getNumPartitions
res1: Int = 1

scala> spark.conf.set("spark.sql.shuffle.partitions", 42)

scala> df.orderBy("id").rdd.getNumPartitions
res3: Int = 42

TechQA.

Preserving the number of partitions of a Spark dataframe after transformation

There are 1 answers

Related Questions in APACHE-SPARK

Related Questions in APACHE-SPARK-SQL

Related Questions in PARTITIONING

Related Questions in DATA-PARTITIONING

Popular Questions

Popular Tags

Trending Questions