Missing values in Sparklyr

2k views Asked by At

I am trying to count the missing values of a particular coulmn in the DataFrame in Sparklyr like below

 count(filter(subdata, isNull(subdata$metric)))
Source:   query [1 x 1]
Database: spark connection master=local[4] app=sparklyr local=TRUE

       n
   <dbl>
1 216360``

But the result returned is the total number of rows in the dataframe. Am I missing something. Kindly Point it out.

2

There are 2 answers

0
Jared Wilber On BEST ANSWER

The following function will count the number of NA values for a given column using sparklyr:

  count_na_values <- function(column) {
    # Count NA Values for a given column using sparklyr.
    #
    # Args:
    #   column: (char) name of column.
    na_count <- df %>%
      filter(is.na(rlang::sym(column))) %>%
      sdf_nrow()
    na_count
  }

Note - df should be of class "tbl_spark" "tbl_sql" "tbl_lazy" "tbl"; e.g.

df <- tbl(sc, <table>)

0
zero323 On

It looks like your mixing SparkR (isNull) and sparklyr (the rest) APIs. As far as I am aware this is not supported and at the first glance it looks your code should actually throw an exception.

df <- data.frame(x=c(1, NA), y=c(-1, 2))
copy_to(sc, df, "df", overwrite=TRUE) %>% filter(is.na(x)) %>% count()
Source:   query [1 x 1]
Database: spark connection ...
      n
  <dbl>
1     1