pyspark partition in join needed when it is already present in where/filter

18 views Asked by At
books_df = (
        spark.table("books")
        .where(
            (F.col("yyyy_mm_dd") == day)
        )

audio_df =(
        spark.table("audios")
        .where(
            (F.col("yyyy_mm_dd") == day)
        )
    
 final_df= result_df.join(audio_df, ["table_id","yyyy_mm_dd"], 'leftsemi')

"yyyy_mm_dd" is the partition for both result_df and audio_df. So in the join, is there any advantage in adding "yyyy_mm_dd"? I think it is not needed as it is already covered in where condition. How do I know the performance difference?

0

There are 0 answers