Sum null values using Koalas

551 views Asked by At

What is a good method to sum dataframes for all Null / NaN values when using Koalas?

or stated another way

How might I return a list by column of total null value counts. I am trying to avoid converting the dataframe to spark or pandas if possible.

NOTE: .sum() omits null values in Koalas (skipna:boolean, default True - can't change to False). So running df1.isnull().sum() is out of the question

numpy was listed as an alternative but due to the dataframe being in Koalas I observed that .sum() still was omitting the nan values.

Disclaimer: I get I can run pandas on Spark but I understand that is counter productive resource wise. I hesitate to sum it from a Spark or Pandas dataframe and then convert the dataframe into Koalas (again wasting resources in my opinion). I'm working with a dataset that contains 73 columns and 4m rows.

1

There are 1 answers

1
Bram On

You can actually use df.isnull(). The reason for that is that it returns an "array" of booleans to indicate whether a value is missing. Therefore, if you first call isnull and then sum you will get the correct count.

Example:

import databricks.koalas as ks

df = ks.DataFrame([
  [1, 3, 9],
  [2, 3, 7],
  [3, None, 3]
], ["c1", "c2", "c3"])

df.isnull().sum()