Joining multiple data frames in one statement and selecting only required columns

11.4k views Asked by At

I have the following Spark DataFrames:

  • df1 with columns (id, name, age)
  • df2 with columns (id, salary, city)
  • df3 with columns (name, dob)

I want to join all of these Spark data frames using Python. This is the SQL statement I need to replicate.

SQL:

select df1.*,df2.salary,df3.dob
from df1    
left join df2 on df1.id=df2.id    
left join df3 on df1.name=df3.name

I tried something that looks like below in Pyspark using python but I am receiving an error.

joined_df = df1.join(df2,df1.id=df2.id,'left')\
               .join(df3,df1.name=df3.name)\
               .select(df1.(*),df2(name),df3(dob)

My question: Can we join all the three DataFrames in one go and select the required columns?

2

There are 2 answers

5
Tanjin On

You can leverage col and alias to get the SQL-like syntax to work. Ensure your DataFrames are aliased:

df1 = df1.alias('df1')
df2 = df2.alias('df2')
df3 = df3.alias('df3')

Then the following should work:

from pyspark.sql.functions import col

joined_df = df1.join(df2, col('df1.id') == col('df2.id'), 'left') \
.join(df3, col('df1.name') == col('df3.name'), 'left') \
.select('df1.*', 'df2.salary', 'df3.dob')
0
pault On

If you have a SQL query that works, why not use pyspark-sql?

First use pyspark.sql.DataDrame.createOrReplaceTempView() to register your DataFrame as a temporary table:

df1.createOrReplaceTempView('df1')
df2.createOrReplaceTempView('df2')
df3.createOrReplaceTempView('df3')

Now you can access these DataFrames as tables with the names you provided in the argument to createOrReplaceTempView(). Use pyspark.sql.SparkSession.sql() to execute your query:

query = "select df1.*, df2.salary, df3.dob " \
        "from df1 " \
        "left join df2 on df1.id=df2.id "\
        "left join df3 on df1.name=df3.name"

joined_df = spark.sql(query)