Insert selective columns into pyspark dataframe

39 views Asked by At

I have a dataframe with the following schema :

col1 as col1, col 2, col 3, . . col n

I have another dataframe with the following schema

col1 as diffName, col n+1, col n+2 . . .

how can i append values in col 1 of dataframe and the rest of the columns as null values?

i have tried using union as well as merge but to no avail

2

There are 2 answers

0
Martin Cook On BEST ANSWER

You could manually make the columns of both dataframes match, then union them. For example, with these dataframes:

df1 = spark.createDataFrame([
    (1, 2),
    (3, 4)
], ['a', 'b'])

df2 = spark.createDataFrame([
    (5, 6, 7),
    (8, 9, 10)
], ['a2', 'c', 'd'])

df2 = df2.withColumnRenamed('a2', 'a')

The first step would be to make sure the column that is common to both dataframes has the same name:

df2 = df2.withColumnRenamed('a2', 'a')

Then you can make the columns match:

for c in df2.columns:
    if c not in df1.columns:
        df1 = df1.withColumn(c, lit(None))

for c in df1.columns:
    if c not in df2.columns:
        df2 = df2.withColumn(c, lit(None))

Finally, you can take the union. I find unionByName to be safer:

df_all = df1.unionByName(df2)
2
Samuel Demir On

Maybe You are looking for the unionByName function but with specfiying allowMissingColumns=True keyword.

This function is since spark>=3.1 available. If you are using an older version consider the other answer from Martin Cook.

df1:

enter image description here

df2:

enter image description here

df3:

enter image description here


from pyspark.sql import Row
from pyspark.sql.types import StructField, StructType, StringType, IntegerType

df1 = spark.createDataFrame(
    data = [Row(name="John", age=27), Row(name="Tony", age=27)],
    schema=StructType([
        StructField(name="name", dataType=StringType(), nullable=True),
        StructField(name="age", dataType=IntegerType(), nullable=True),
]))

df2 = spark.createDataFrame(
    data = [Row(name="John", zip=1234), Row(name="Tony", zip=1235)],
    schema=StructType([
        StructField(name="name", dataType=StringType(), nullable=True),
        StructField(name="zip", dataType=IntegerType(), nullable=True)
]))

df3 = df1.unionByName(df2, allowMissingColumns=True)
df3.display()