I have an sdf with the id(PK) and several columns, the latter of which may contain null values. I'd like to find an efficient way to filter rows which at least has one value in its columns.
Let's say here is the table:
+-----------+-------+-------+-------+
| id| clm_01| clm_02| clm_03|...
+-----------+-------+-------+-------+-
| 10001 | null| null | 5|...
| 10002 | 1| 3 | 2|...
| 10003 | null| null | null|...
...
+-----------+-------+-------+-------+
From the table above, I would like to get the row with the id 10003. This could easily be done with the script below;
sdf.withColumn(
'flg',
when(
col('clm_01').isNull() & col('clm_02').isNull() & col('clm_01').isNull(),1).\
otherwise(0)
).\
filter(col('flg') != 1)
But how do you apply the condition clause to more columns, without repeating isNull() chain one hundred times?
Thanks for your help in advance.
You can use
coalesce
,least
orgreatest
functions. They returnnull
if all the columns arenull
:or in that way only with
coalesce
: