What is the role of Catalyst optimizer and Project Tungsten

3.9k views Asked by At

I am unclear on the roles of Catalyst optimizer and Project Tungsten.

My understanding is that Catalyst optimizer will produce optimized Physical plan from logical plan. The optimized physical plan will then taken by Code generator to emit Rdd's.

Is the Code generator part of Project Tungsten or Catalyst Optimizer? And is the Code generator also called "Whole Stage Code generator"?

1

There are 1 answers

1
Michael Heil On

A look into the Glossar from Databricks or other online resources should clarify your doubts:

Tungsten

"Tungsten is the codename for the umbrella project to make changes to Apache Spark’s execution engine that focuses on substantially improving the efficiency of memory and CPU for Spark applications, to push performance closer to the limits of modern hardware."

Catalyst Optimizer

The Catalyst optimizer takes your code and converts it into an execution plan which finally ends up in generating compact code for the JVM. It goes through four transformational phases depicted in the picture below:

enter image description here

Note, that the "Code Generation" phase is the fourth phase in the Catalyst Optimizer. More details in the subsequent seciont.

WholeStage Code Generator

"Whole-Stage CodeGen is also known as Whole-Stage Java Code Generation, which is a physical query optimization phase in Spakr SQL that clubs multiple physical operations together to form a single Java function."