pyspark partition in join needed when it is already present in where/filter

18 views Asked by Blue Clouds At 10 July 2023 at 19:57

books_df = (
        spark.table("books")
        .where(
            (F.col("yyyy_mm_dd") == day)
        )

audio_df =(
        spark.table("audios")
        .where(
            (F.col("yyyy_mm_dd") == day)
        )
    
 final_df= result_df.join(audio_df, ["table_id","yyyy_mm_dd"], 'leftsemi')

"yyyy_mm_dd" is the partition for both result_df and audio_df. So in the join, is there any advantage in adding "yyyy_mm_dd"? I think it is not needed as it is already covered in where condition. How do I know the performance difference?

Original Q&A

TechQA.

pyspark partition in join needed when it is already present in where/filter

There are 0 answers

Related Questions in JOIN

Related Questions in PYSPARK

Related Questions in HADOOP-PARTITIONING

Popular Questions

Trending Questions