books_df = (
spark.table("books")
.where(
(F.col("yyyy_mm_dd") == day)
)
audio_df =(
spark.table("audios")
.where(
(F.col("yyyy_mm_dd") == day)
)
final_df= result_df.join(audio_df, ["table_id","yyyy_mm_dd"], 'leftsemi')
"yyyy_mm_dd" is the partition for both result_df and audio_df. So in the join, is there any advantage in adding "yyyy_mm_dd"? I think it is not needed as it is already covered in where condition. How do I know the performance difference?