I am trying to understand spark hiveContext
.
when we write query using hiveContext
like
sqlContext=new HiveContext(sc)
sqlContext.sql("select * from TableA inner join TableB on ( a=b) ")
Is it using Spark Engine OR Hive Engine?? I believe above query get executed with Spark Engine. But if thats the case why we need dataframes?
We can blindly copy all hive queries in sqlContext.sql("")
and run without using dataframes.
By DataFrames, I mean like this TableA.join(TableB, a === b)
We can even perform aggregation using SQL commands. Could any one Please clarify the concept? If there is any advantage of using dataframe joins rather that sqlContext.sql()
join?
join is just an example. :)
The Spark HiveContext uses Spark execution engine underneath see the spark code.
Parser support in spark is pluggable, HiveContext uses spark's HiveQuery parser.
Functionally you can do everything with sql and Dataframes are not needed. But dataframes provided a convenient way to achieve the same results. The user doesn't need to write a SQL statement.