RPostgreSQL - R Connection to Amazon Redshift - How to WRITE/Post Bigger Data Sets

2.6k views Asked by At

I'm experimenting with how to connect R with Amazon's Redshift - and publishing a short blog for other newbies.

Some good progress - I'm able to do most things (create tables, select data, and even sqlSave or dbSendQuery 'line by line' HOWEVER, I have not found a way to do a BULK UPLOAD of a table in one shot (e.g. copy the whole 5X150 IRIS table/data frame to Redshift) - that doesnt take more than a minute.

Question: Any advice for a newish person to RPostgreSQL on how to write/upload a block of data to Redshift would be greatly appreciated!

RODBC:

colnames(iris) <- tolower(colnames(iris)) 
sqlSave(channel,iris,"iris", rownames=F) 

SLOOOOOOW! SO SLOW! Must be a better way 150 ~1.5 minutes

iris_results <- sqlQuery(channel,"select * from iris where species = 'virginica'") # fast subset. this does work and shows up on AWS Redshift Dashboard

sqlDrop(channel, "iris", errors = FALSE) # clean up our toys
RPostgreSQL

dbSendQuery(con, "create table iris_200 (sepallength float,sepalwidth float,petallength float,petalwidth float,species VARCHAR(100));")
dbListFields(con,"iris_200")

ONE BY ONE insert four rows into the table

dbSendQuery(con, "insert into iris_200 values(5.1,3.5,1.4,0.2,'Iris-setosa');")

dbSendQuery(con, "insert into iris_200 values(5.5,2.5,1.1,0.4,'Iris-setosa');")

dbSendQuery(con, "insert into iris_200 values(5.2,3.3,1.2,0.3,'Iris-setosa');")

dframe <-dbReadTable(con,"iris_200") # ok

dbRemoveTable(con,"iris_200")  # and clean up toys

or loop through table (takes about 1 per second)

for (i in 1:(dim(iris_200)[1]) ) {
query <- paste("insert into iris_200 values(",iris_200[i,1],",",iris_200[i,2],",",
iris_200[i,3],",",iris_200[i,4],",","'",iris_200[i,5],"'",");",sep="")

print(paste("row",i,"loading data >>  ",query))

dbSendQuery(con, query)
}

So briefly, this is the hacky/slow way - any advice on how to upload/insert bulk data appreciated - thanks!!

Full code here:

PS - got this error message: LOAD source is not supported. (Hint: only S3 or DynamoDB or EMR based load is allowed)


Update 6/12/2015 - Direct Load of bulk data at reasonable speed may not be possible, noting error message above, and noted in this blog - LOADING DATA section of http://dailytechnology.net/2013/08/03/redshift-what-you-need-to-know/

It notes

So now that we’ve created out data structure, how do we get data into it? You have two choices: 1) Amazon S3 2) Amazon DynamoDB Yes, you could simply run a series of INSERT statements, but that is going to be painfully slow. (!)

Amazon recommends using the S3 method, which I will describe briefly. I don’t see the DynamoDB as particularly useful unless you’re already using that and want to migrate some of your data to Redshift.

To get the data from your local network to S3.....

RA: Will post updates if I figure this out

1

There are 1 answers

0
Sicarul On

It may be too late for the OP, but i'll post this here for future reference if someone finds the same issue:

The steps to do a Bulk insert are:

  • Create a table in Redshift with the same structure as my data frame
  • Split the data into N parts
  • Convert the parts into a format readable by Redshift
  • Upload all the parts to Amazon S3
  • Run the COPY statement on Redshift
  • Delete the temporary files on Amazon S3

I’ve created an R Package which does exactly this, except for the first step, and it’s called redshiftTools: https://github.com/sicarul/redshiftTools

To install the package, you’ll need to do:

install.packages('devtools')
devtools::install_github("RcppCore/Rcpp")
devtools::install_github("rstats-db/DBI")
devtools::install_github("rstats-db/RPostgres")
devtools::install_github("hadley/xml2")
install.packages("aws.s3", repos = c(getOption("repos"), "http://cloudyr.github.io/drat"))
devtools::install_github("sicarul/redshiftTools")

Afterwards, you’ll be able to use it like this:

library("aws.s3")
library(RPostgres)
library(redshiftTools)

con <- dbConnect(RPostgres::Postgres(), dbname="dbname",
host='my-redshift-url.amazon.com', port='5439',
user='myuser', password='mypassword',sslmode='require')

rs_replace_table(my_data, dbcon=con, tableName='mytable', bucket="mybucket")
rs_upsert_table(my_other_data, dbcon=con, tableName = 'mytable', bucket="mybucket", keys=c('id', 'date'))

rs_replace_table truncates the target table and then loads it entirely from the data frame, only do this if you don’t care about the current data it holds. On the other hand, rs_upsert_table replaces rows which have coinciding keys, and inserts those that do not exist in the table.