Are coalesce + orderBy commutative in Spark?

Question

Are coalesce + orderBy commutative in Spark?

457 views Asked by vanhooser At 15 December 2020 at 00:40

I want to run the following code:

df = df.coalesce(1).orderBy(["my_col"])

but its execution will obviously bottleneck on a single task doing all the sort work.

I know it's possible to run the following:

df = df.orderBy(["my_col"]).coalesce(1)

however I am uncertain if Spark will maintain the ordering after the partitions are collapsed. Does it?

The second code will be preferred if so as the sort will be performed distributed and the results merged after, but I am worried it might not be properly preserved.

If it is preserved, this would mean the two are commutative!

Original Q&A

There are 1 answers

**Dean Xu** · Answer 1 · 2020-12-15T01:11:42+00:00

It's easy to know what Spark will do by using explain

> df = spark.range(1,100)
> df.coalesce(1).orderBy('id').explain()
== Physical Plan ==
*(2) Sort [id#0L ASC NULLS FIRST], true, 0
+- Coalesce 1
   +- *(1) Range (1, 100, step=1, splits=4)

So the answer is, they are not commutative.

TechQA.

Are coalesce + orderBy commutative in Spark?

There are 1 answers

Related Questions in APACHE-SPARK

Related Questions in PYSPARK

Related Questions in PALANTIR-FOUNDRY

Related Questions in COMMUTATIVITY

Popular Questions

Popular Tags

Trending Questions