We have a project where currently Shell script & Hive Execution engine: TEZ is being used.

For POC purpose we are trying to convert all hive queries to sparkSql code.

One of the client came back with a question that why would we need spark application as we can set spark as an execution engine and we can run our regular shell scripts and oozie workflow.

If someone has already done the POC, can you please explain the difference and more important which approach is faster in terms of performance.

1 Answers


If you already have business logic implemented in hive & all scheduled in production it's better to set the execution engine as spark in hive & use it(Best & safest option). There should not be performance impact since the optimizer for your queries will still remain same (catalyst) so all the logical & physical execution plans created still remains same.

You can go for spark application creation using Scala/Pyspark script if you need programming for your transformations which you can't achive using HQL.