spark jdbc - multiple connections to source?

559 views Asked by At

someone mentioned that when we are using spark.read JDBC that generates a dataframe and afterwards if we do df.write twice on that dataframe. **DOES it creates two connections to the source ? ** I need some help with more insights on this innerworkings of spark.

so let's say I created a function to return a df

read_df():
df = spark.read.format("jdbc").option("header", "true").option("inferSchema","true").option("url",jdbc_str[0]).option("lowerBound", mini).option("upperBound", maxi).option("numPartitions", num_partitions). option("partitionColumn", df_info[2]).option("dbtable", tablename).option("user", df_info[0]).option("password", df_info[1]).option("driver", "com.informix.jdbc.IfxDriver").load()
return df

now I am taking the df returned from the above function to write at two places.

def write_df_delta(df):
df.write.format("delta").partitionBy("partitioncolumn").save(location)
return "successful"
def write_df_csvserde(df):
df.coalesce(1).write.option("header", "true").mode("append").csv(target_dir)
return "successful"

now if I call this in main as below, will that really make two connections to the source? if yes; what is the way to avoid that and read-only once. the documentation on spark for load here quotes load() "Loads data from a data source and returns it as a:classDataFrame." so need more context on what internally at play here.

def main():
df=read_df()
status = write_df_delta(df)  
status = write_df_csvserde(df)
if __name__ == '__main__' :
    main()
1

There are 1 answers

0
thebluephantom On

As you have no .cache or .persist it will read from JDBC source twice.

Two Actions evident.

Caching also has a cost.