The Object in the S3 bucket is 5.3 GB size. In order to convert object into data, I used get_object("link to bucket path")
. But this leads to memory issues.
So, I installed Spark 2.3.0 in RStudio and trying to load this object directly into Spark but the command to load object directly into spark is not known.
library(sparklyr)
library(dplyr)
sc <- spark_connect(master = "local")
If I convert the object into a readable data type (such as data.frame/tbl in R) I would use copy_to
to transfer the data into spark from R as below:
Copy data to Spark
spark_tbl <- copy_to(spark_conn,data)
I was wondering how can convert the object inside spark ?
relevant links would be
Any guidance would be sincerely appreciated.
Solution.
I was trying to read the csv file which is 5.3 GB from S3 bucket. But Since R is single-threaded, it was giving memory issues (IO exceptions).
However, the solution is to load sparklyr in R (library(sparklyr)) and hence now all the cores in the computer will be utilized.
get_object("link to bucket path") can be replaced by spark_read_csv("link to bucket path"). Since RStudio uses all cores, we have no memory issues.
Also, depending on the file extension, you can change the functions: ´spark_load_table, spark_read_jdbc, spark_read_json, spark_read_libsvm, spark_read_parquet, spark_read_source, spark_read_table, spark_read_text, spark_save_table, spark_write_csv, spark_write_jdbc, spark_write_json, spark_write_parquet, spark_write_source, spark_write_table, spark_write_text´