I have the following Spark DataFrames:
df1
with columns(id, name, age)
df2
with columns(id, salary, city)
df3
with columns(name, dob)
I want to join all of these Spark data frames using Python. This is the SQL statement I need to replicate.
SQL:
select df1.*,df2.salary,df3.dob
from df1
left join df2 on df1.id=df2.id
left join df3 on df1.name=df3.name
I tried something that looks like below in Pyspark using python but I am receiving an error.
joined_df = df1.join(df2,df1.id=df2.id,'left')\
.join(df3,df1.name=df3.name)\
.select(df1.(*),df2(name),df3(dob)
My question: Can we join all the three DataFrames in one go and select the required columns?
You can leverage
col
andalias
to get the SQL-like syntax to work. Ensure your DataFrames are aliased:Then the following should work: