Pyspark: Is there an efficient way to exclude rows with only null values but pk?

495 views Asked by At

I have an sdf with the id(PK) and several columns, the latter of which may contain null values. I'd like to find an efficient way to filter rows which at least has one value in its columns.

Let's say here is the table:

+-----------+-------+-------+-------+
|         id| clm_01| clm_02| clm_03|...
+-----------+-------+-------+-------+-
|    10001  |   null|  null |      5|...
|    10002  |      1|     3 |      2|...
|    10003  |   null|  null |   null|...
        ...
+-----------+-------+-------+-------+

From the table above, I would like to get the row with the id 10003. This could easily be done with the script below;

sdf.withColumn(
  'flg', 
  when(
   col('clm_01').isNull() & col('clm_02').isNull() & col('clm_01').isNull(),1).\
  otherwise(0) 
).\
filter(col('flg') != 1)

But how do you apply the condition clause to more columns, without repeating isNull() chain one hundred times?

Thanks for your help in advance.

1

There are 1 answers

0
busfighter On

You can use coalesce, least or greatest functions. They return null if all the columns are null:

from pyspark.sql import functions as F

columns = list(set(sdf.columns) - {'id'})
sdf = sdf.filter(F.coalesce(*columns).isNull())

or in that way only with coalesce:

sdf = sdf.filter(F.coalesce(*sdf.columns) == F.col('id'))