I have a PySpark dataframe with 50k records (dfa) and another with 40k records (dfb). In dfa, I want to create a new column tagging the 40k records in dfb with 'present' else 'not_present'.
I know pandas has syntax for this but I'm having trouble finding the PySpark syntax.
Input: dfa
col1 | col2 |
---|---|
xyz | row |
abc | row |
def | row |
df2
col1 | col2 |
---|---|
xyz | row |
abc | row |
Expected Output:
df3
col1 | col2 | col3 |
---|---|---|
xyz | row | present |
abc | row | present |
def | row | not_pre |
Full example: