Looking to sort a Spark Data Frame by Index using SparklyR

Question

Looking to sort a Spark Data Frame by Index using SparklyR

770 views Asked by eyeOfTheStorm At 12 December 2016 at 21:54

library(sparklyr)
library(dplyr)
library(Lahman)

spark_install(version = "2.0.0")
sc <- spark_connect(master = "local")

batting_tbl <- copy_to(sc, Lahman::Batting, "batting"); batting_tbl

batting_tbl %>% arrange(-index())
# Error: org.apache.spark.sql.AnalysisException: Undefined function: 'INDEX'. 
# This function is neither a registered temporary 
# function nor a permanent function registered in the database 'default'.; line 3 pos 10

Anyone know how to use dplyr to sort by index with a Spark (sparklyr) DataFrame?

Original Q&A

There are 1 answers

**eyeOfTheStorm** · Accepted Answer · 2016-12-13T00:36:28+00:00

This is the best solution I could come up with. Although correct, the sdf_with_unique_id function returns some very high sequential values above the 62,000 row. Regardless, it's one way to create a distributed index column with SparklyR.

library(sparklyr)
library(dplyr)
library(Lahman)

options(tibble.width = Inf) 
options(dplyr.print_max = Inf) 

spark_install(version = "2.0.0")
sc <- spark_connect(master = "local")

batting_tbl <- copy_to(sc, Lahman::Batting, "batting"); batting_tbl
tbl_uncache(sc, "batting")

y <- Lahman::Batting

batting_tbl <- batting_tbl %>% sdf_with_unique_id(., id = "id") # Note 62300 threshold for higher values
batting_tbl %>% arrange(-id)

TechQA.

Looking to sort a Spark Data Frame by Index using SparklyR

There are 1 answers

Related Questions in R

Related Questions in APACHE-SPARK

Related Questions in DPLYR

Related Questions in APACHE-SPARK-SQL

Related Questions in SPARKLYR

Popular Questions

Popular Tags

Trending Questions