I've a dataframe from hive table I'm doing some changes to it, then while saving it again in hive as a new table which method should I use ? Assume this dataframe has 70 million record, I want to make saving process memory & time efficient.
For eg.
Dataframe name = df
df.createOrReplaceView(new_table)
SQL("create table new_table as select * from new_table)
df.write.saveAsTable("new_table")
The way I see it there's no way operation 1 can be more efficient.
createOrReplaceView
is creating a temporary table in memory, you can read about it in this previous question.As such between (1) Reading from disk to create a temp table in memory, to write the same table to disk, and (2) Reading from disk to write to disk, number 2 seems the obvious favorite.
If this answer doesn't satisfy you. You can always try both ways and check the total time and memorySeconds consumed in the YARN application UI.