Changing column data type to factor with sparklyr

6k views Asked by At

I am pretty new to Spark and am currently using it using the R API through sparkly package. I created a Spark data frame from hive query. The data types are not specified correctly in the source table and I'm trying to reset the data type by leveraging the functions from dplyr package. Below is the code I tried:

prod_dev <- sdf_load_table(...)
num_var <-  c("var1", "var2"....)
cat_var <-  c("var_a","var_b", ...)

pos1 <- which(colnames(prod_dev) %in% num_var)
pos2 <- which(colnames(prod_dev) %in% cat_var)

prod_model_tbl <- prod_dev %>% 
                mutate(age = 2016- as.numeric(substr(dob_yyyymmdd,1,4))) %>%
                mutate(msa_fg = ifelse(is.na(msacode2000), 0, 1)) %>% 
                mutate(csa_fg = ifelse(is.na(csacode), 0, 1)) %>%
                mutate_each(funs(factor), pos2) %>%
                mutate_each(funs(as.numeric), pos1)

The code will work if prod_dev is a R data frame. But using it on a Spark Data frame does not seem to produce the correct result:

> head(prod_model_tbl)


    Source:   query [?? x 99]
    Database: spark connection master=yarn-client app=sparklyr_test local=FALSE

    Error: org.apache.spark.sql.AnalysisException: undefined function     FACTOR; line 97 pos 2248 at org.apache.spark.sql.hive.HiveFunctionRegistry....

Can someone please advise how to make the desired changes to the Spark Data Frame?

1

There are 1 answers

0
zero323 On BEST ANSWER

In general you can use standard R generic functions for type casting. For example:

df <- data.frame(x=c(1, NA), y=c("-1", "2"))

copy_to(sc, df, "df", overwrite=TRUE) %>% 
  mutate(x_char = as.character(x)) %>% 
  mutate(y_numeric = as.numeric(y))
Source:   query [2 x 4]
Database: spark connection master=...

      x     y x_char y_numeric
  <dbl> <chr>  <chr>     <dbl>
1     1    -1    1.0        -1
2   NaN     2   <NA>         2

The problem is Spark doesn't provide any direct equivalent of R factor.

In Spark SQL we use double type and column metadata to represent categorical variables and ML Transformers, which are not a part of SQL, for encoding. Therefore there is no place for factor / as.factor. SparkR provides some automatic conversions when working with ML but I am not sure if there is similar mechanism in sparklyr (the closest thing I am aware of is ml_create_dummy_variables).