pyspark delta table: How to save a grouped Dataframe to Different Tables

Question

pyspark delta table: How to save a grouped Dataframe to Different Tables

235 views Asked by nir At 16 October 2023 at 19:11

similar to this question. How can I do same to write different groups of dataframe to different delta live tables? something similar to following where I am not limited by just panda dataframe. Allowing apply to pass either spark dataframe or a spark session to aggregate function.

def mycustomNotPandaAgg(key, Iterator, sparkSession|sparkDataframe):
   temp_df = sparkSession.createDataFrame(Iterator) #I can apply schema here
   temp_df.createOrReplaceTable("temp_df")
   sparkSession.sql('insert into ... key as select * from temp_df') #key is table_name
   or
   sparkDataframe.writeToTable(key)  #where sparkDataframe is created internally from each group and passed into this apply function

my_df.groupBy("table_name").apply(mycustomNotPandaAgg)

ps - I have already tried filter approach where I filter same dataframe for each table, get N dataframe (1 for each table) and save them. It's not efficient as data is skewed per key. even if I persist the dataframe before filter spark still launches jobs for each filters.

Original Q&A

There are 1 answers

**Philip Dakin** · Answer 1 · 2023-10-16T19:53:48+00:00

One way you could accomplish this without pulling all data to the driver is by collecting the distinct keys, then writing each filtered DataFrame individually:

from pyspark.sql.functions import col

filters = rtd.select("CustomerID").distinct().collect()
for f in filters:
    rtd.filter(col("CustomerID") == f[0]).show() # Replace show() with your write logic.

Note that unfortunately you will have to write the output tables serially. Could work around this with multiprocessing, or perhaps there is another commenter with a more Spark-native way to write groupBy results while using Spark for parallelization.

TechQA.

pyspark delta table: How to save a grouped Dataframe to Different Tables

There are 1 answers

Related Questions in APACHE-SPARK

Related Questions in PYSPARK

Related Questions in DELTA-LAKE

Related Questions in DELTA-LIVE-TABLES

Popular Questions

Popular Tags

Trending Questions