Pyspark: Is there an efficient way to exclude rows with only null values but pk?

Question

Pyspark: Is there an efficient way to exclude rows with only null values but pk?

500 views Asked by Yosher At 08 October 2020 at 11:16

I have an sdf with the id(PK) and several columns, the latter of which may contain null values. I'd like to find an efficient way to filter rows which at least has one value in its columns.

Let's say here is the table:

+-----------+-------+-------+-------+
|         id| clm_01| clm_02| clm_03|...
+-----------+-------+-------+-------+-
|    10001  |   null|  null |      5|...
|    10002  |      1|     3 |      2|...
|    10003  |   null|  null |   null|...
        ...
+-----------+-------+-------+-------+

From the table above, I would like to get the row with the id 10003. This could easily be done with the script below;

sdf.withColumn(
  'flg', 
  when(
   col('clm_01').isNull() & col('clm_02').isNull() & col('clm_01').isNull(),1).\
  otherwise(0) 
).\
filter(col('flg') != 1)

But how do you apply the condition clause to more columns, without repeating isNull() chain one hundred times?

Thanks for your help in advance.

Original Q&A

There are 1 answers

**busfighter** · Answer 1 · 2020-10-08T14:33:59+00:00

You can use coalesce, least or greatest functions. They return null if all the columns are null:

from pyspark.sql import functions as F

columns = list(set(sdf.columns) - {'id'})
sdf = sdf.filter(F.coalesce(*columns).isNull())

or in that way only with coalesce:

sdf = sdf.filter(F.coalesce(*sdf.columns) == F.col('id'))

TechQA.

Pyspark: Is there an efficient way to exclude rows with only null values but pk?

There are 1 answers

Related Questions in PYTHON

Related Questions in DATAFRAME

Related Questions in PYSPARK

Related Questions in NULL

Related Questions in SDF

Popular Questions

Popular Tags

Trending Questions