I am pretty new to Spark and am currently using it using the R API through sparkly package. I created a Spark data frame from hive query. The data types are not specified correctly in the source table and I'm trying to reset the data type by leveraging the functions from dplyr
package. Below is the code I tried:
prod_dev <- sdf_load_table(...)
num_var <- c("var1", "var2"....)
cat_var <- c("var_a","var_b", ...)
pos1 <- which(colnames(prod_dev) %in% num_var)
pos2 <- which(colnames(prod_dev) %in% cat_var)
prod_model_tbl <- prod_dev %>%
mutate(age = 2016- as.numeric(substr(dob_yyyymmdd,1,4))) %>%
mutate(msa_fg = ifelse(is.na(msacode2000), 0, 1)) %>%
mutate(csa_fg = ifelse(is.na(csacode), 0, 1)) %>%
mutate_each(funs(factor), pos2) %>%
mutate_each(funs(as.numeric), pos1)
The code will work if prod_dev is a R data frame. But using it on a Spark Data frame does not seem to produce the correct result:
> head(prod_model_tbl)
Source: query [?? x 99]
Database: spark connection master=yarn-client app=sparklyr_test local=FALSE
Error: org.apache.spark.sql.AnalysisException: undefined function FACTOR; line 97 pos 2248 at org.apache.spark.sql.hive.HiveFunctionRegistry....
Can someone please advise how to make the desired changes to the Spark Data Frame?
In general you can use standard R generic functions for type casting. For example:
The problem is Spark doesn't provide any direct equivalent of R
factor
.In Spark SQL we use
double
type and column metadata to represent categorical variables and MLTransformers
, which are not a part of SQL, for encoding. Therefore there is no place forfactor
/as.factor
. SparkR provides some automatic conversions when working with ML but I am not sure if there is similar mechanism insparklyr
(the closest thing I am aware of isml_create_dummy_variables
).