I have a dataframe and need to see if it contains null values. There are plenty of posts on the same topic but nearly all of them use the count action or the show method.

count operations are prohibitively expensive in my case as the data volume is large. Same for the show method.

Is there a way in which I can ask spark to look for null values and raise an error as soon as it encounters the first null value?

The solutions in other posts give the count of missing values in each column. I don't need to know the number of missing values in every column. I just want to know if there is any cell in the dataframe with a null value.

3 Answers

ollik1 On Best Solutions

You can use limit for that

Anupam Kumar On

As i understand your requirement is to just raise flag if any of the column has null. You don't need to know the which all actual rows are having null.

Solution: The easiest one i can think of creating a tempView of your DataFrame and check null on all the could. Here is the pseudocode for that-

tempViewDF = sqlContext.sql("SELECT count(*) FROM tempView WHERE Col1 is null or Col2 is null or col3 is null")


if tempViewDF > 0:

Now use flag as you want.

Regards, Anupam

Shaido On

You have to potentially go through all values and check for null values. This can be done by either traversing the dataframe in a column-wise or row-wise fashion. Which one is best depends on the data (use heuristics).

Row-wise traversal:

import pyspark.sql.functions as f
from functools import reduce

df.where(reduce(lambda x, y: x | y, (f.col(x).isNull() for x in df.columns))).limit(1).collect().isEmpty

Column-wise traversal (empirically this should be faster, see comment by Clock Slave):

import pyspark.sql.functions as f

contains_nulls = False
for c in df.columns:
  if not df.where(f.col(c).isNull()).limit(1).collect().isEmpty:
    contains_nulls = True

limit(1) is used to stop when the first null value is found and collect().isEmpty to check if the dataframe is empty or not.