For a script that I am running, I have a bunch of chained views that looked at a specific set of data in sql (I am using Apache Spark SQL):
%sql create view view_1 as select column_1,column_2 from original_data_table
This logic culminates in
However, I then need to perform logic that is difficult (or impossible) to implement in sql, specifically, the
%python df_1 = sqlContext.sql("SELECT * from view_n") df1_exploded=df_1.withColumn("exploded_column", explode(split(df_1f.col_to_explode,',')))
Is there a speed cost associated with switching to and from sql tables to pyspark dataframes? Or, since pyspark dataframes are lazily evaluated, is it very similair to a view?
Is there a better way of switching from and sql table to a pyspark dataframe?