I am working with a dataframe that has the following structure, using sparklyr and dplyr:
lab | x | y
------|---|---
a | 1 | 1
a | 2 | 2
a | 3 | 3
b | 1 | 2
b | 2 | 4
b | 3 | 6
What i need is a linear regression model trained for every partition of lab and because the original dataframe is quite big i cannot afford to use a cycle that iterates over the partitions and serially calculates all the regressions. Therefore i tried using the spark_apply function, as follows:
df = data.frame("x"=c(1, 2, 3, 1, 2, 3),
"y"=c(1, 2, 3, 2, 4, 6),
"lab" = c("a", "a", "a", "b", "b", "b"))
sdf_df = sdf_copy_to(sc, df, overwrite = TRUE)
fit_part = function(df){
model = ml_linear_regression(df, y ~ x)
result = ml_predict(model, select(df, "x"))
return(result)
}
spark_apply(sdf_df, fit_part, group_by = "lab")
This produces an obscure error starting by:
Error: org.apache.spark.SparkException: Job aborted due to stage failure
Is it even possible to use spark_apply this way? And if not, how would you go about solving this problem?
thank you!