Preserving the number of partitions of a Spark dataframe after transformation

1.5k views Asked by At

I am looking at a bug in the code where a dataframe has been split into too many partitions than desired (over 700), and this causes too many shuffle operations when I try to repartition them to only 48. I can't use a coalesce() here because I want to have fewer partitions in the first place before I do a repartition.

I am looking at ways to reduce the number of partitions. Let's say I have a spark dataframe (with multiple columns) divided into 10 partitions. I need to do an orderBy transformation based on one of the columns. After this operation is done, will the resulting dataframe have the same number of partitions? If not, how would spark decide on the number of partitions?

Also what are other transformations that could cause a change in the number of partitions for a dataframe, that I need to be aware of, other than the obvious ones like repartition()?

1

There are 1 answers

0
Alper t. Turker On

Number of partitions for operations requiring exchange is defined by spark.sql.shuffle.partitions. If you want a particular value you should set it before executing the command:

scala> val df = spark.range(0, 1000)
df: org.apache.spark.sql.Dataset[Long] = [id: bigint]

scala> spark.conf.set("spark.sql.shuffle.partitions", 1)

scala> df.orderBy("id").rdd.getNumPartitions
res1: Int = 1

scala> spark.conf.set("spark.sql.shuffle.partitions", 42)

scala> df.orderBy("id").rdd.getNumPartitions
res3: Int = 42