Does writing the code in Dataframe API format rather than Spark.sql queries have any significance advantage ?
Would like to know whether Catalyst optimizer would be working on spark.sql queries also or not .
Does writing the code in Dataframe API format rather than Spark.sql queries have any significance advantage ?
Would like to know whether Catalyst optimizer would be working on spark.sql queries also or not .
Whether you write the code using DataFrame API or Spark Sql API , there is no significant difference in terms of performance because both the dataframe api and spark sql api are abstractions on top of RDD (Resilient Distributed Dataset).
Catalyst Optimizer optimizes structural queries – expressed in SQL, or via the DataFrame/Dataset APIs – which can reduce the runtime of programs and save costs.
To answer your question Catalyst Optimizer works on both Spark sql as well as Dataframe/Dataset Apis.
If you want to have a more detail understanding of the internal and how it works you can check out this article which explains it in detail.
https://unraveldata.com/resources/catalyst-analyst-a-deep-dive-into-sparks-optimizer/
your dataframe transformations and spark sql querie will be translated to execution plan anyway and Catalyst will optimize it.
The main advantage of dataframe api is that you can use dataframe optimize fonction, for example :
cache()
, in general you will have more control of the execution plan.I feel like it easier to test your code also, people tend to write 1 huge query ...