What is a good method to sum dataframes for all Null / NaN values when using Koalas?
or stated another way
How might I return a list by column of total null value counts. I am trying to avoid converting the dataframe to spark or pandas if possible.
NOTE: .sum()
omits null values in Koalas (skipna:boolean, default True - can't change to False). So running df1.isnull().sum()
is out of the question
numpy was listed as an alternative but due to the dataframe being in Koalas I observed that .sum() still was omitting the nan values.
Disclaimer: I get I can run pandas on Spark but I understand that is counter productive resource wise. I hesitate to sum it from a Spark or Pandas dataframe and then convert the dataframe into Koalas (again wasting resources in my opinion). I'm working with a dataset that contains 73 columns and 4m rows.
You can actually use
df.isnull()
. The reason for that is that it returns an "array" of booleans to indicate whether a value is missing. Therefore, if you first callisnull
and thensum
you will get the correct count.Example: