How to load objects from S3 bucket into Spark in RStudio?

657 views Asked by At

The Object in the S3 bucket is 5.3 GB size. In order to convert object into data, I used get_object("link to bucket path"). But this leads to memory issues.

So, I installed Spark 2.3.0 in RStudio and trying to load this object directly into Spark but the command to load object directly into spark is not known. library(sparklyr) library(dplyr) sc <- spark_connect(master = "local")

If I convert the object into a readable data type (such as data.frame/tbl in R) I would use copy_to to transfer the data into spark from R as below:

Copy data to Spark

spark_tbl <- copy_to(spark_conn,data)

I was wondering how can convert the object inside spark ?

relevant links would be

  1. https://github.com/cloudyr/aws.s3/issues/170

  2. Sparklyr connection to S3 bucket throwing up error

Any guidance would be sincerely appreciated.

1

There are 1 answers

0
Abhishek On BEST ANSWER

Solution.

I was trying to read the csv file which is 5.3 GB from S3 bucket. But Since R is single-threaded, it was giving memory issues (IO exceptions).

However, the solution is to load sparklyr in R (library(sparklyr)) and hence now all the cores in the computer will be utilized.

get_object("link to bucket path") can be replaced by spark_read_csv("link to bucket path"). Since RStudio uses all cores, we have no memory issues.

Also, depending on the file extension, you can change the functions: ´spark_load_table, spark_read_jdbc, spark_read_json, spark_read_libsvm, spark_read_parquet, spark_read_source, spark_read_table, spark_read_text, spark_save_table, spark_write_csv, spark_write_jdbc, spark_write_json, spark_write_parquet, spark_write_source, spark_write_table, spark_write_text´