library(sparklyr)
library(dplyr)
library(Lahman)
spark_install(version = "2.0.0")
sc <- spark_connect(master = "local")
batting_tbl <- copy_to(sc, Lahman::Batting, "batting"); batting_tbl
batting_tbl %>% arrange(-index())
# Error: org.apache.spark.sql.AnalysisException: Undefined function: 'INDEX'.
# This function is neither a registered temporary
# function nor a permanent function registered in the database 'default'.; line 3 pos 10
Anyone know how to use dplyr to sort by index with a Spark (sparklyr) DataFrame?
This is the best solution I could come up with. Although correct, the
sdf_with_unique_id
function returns some very high sequential values above the 62,000 row. Regardless, it's one way to create a distributed index column with SparklyR.