Normally we do not find any trouble in using the below connection method and run queries from redshift
require("RPostgreSQL")
drv <- dbDriver("PostgreSQL")
conn <- dbConnect(drv, dbname = "redshiftdb",
host = "XX.XX.XX.XX", port = "1234",
user = "userid", password = "pwd")
my_data <- dbGetQuery(conn, "select a.*, b.* from redshiftdb.schema1.table1 inner join redshiftdb.schema2.table2 on a.key = b.key")
But the problem with this method is people can use long complex SQL queries which becomes hard to debug and illustrate while re engineering. Unless I am hard core SQL coder.
I have been learning R language since September and I thought it would it would be interesting to use dplyr joins and pipes to do the same work. I connected using
conn <- src_postgres(dbname = "redshiftdb",
host = "XX.XX.XX.XX", port = 1234,
user = "userid",
password = "pwd")
my_tbl1 <- tbl(conn, dplyr::sql('select * from schema1.table1'))
my_tbl2 <- tbl(conn, dplyr::sql('select * from schema1.table2'))
my_tbl3 <- tbl(conn, dplyr::sql('select * from schema1.table3'))
my_tbl4 <- tbl(conn, dplyr::sql('select * from schema1.table4'))
my_tbl5 <- tbl(conn, dplyr::sql('select * from schema2.table1'))
my_tbl6 <- tbl(conn, dplyr::sql('select distinct var1, var2 from schema2.table2'))
my_tbl7 <- tbl(conn, dplyr::sql('select * from schema2.table3'))
I get the above error using left_join and %>% to join across tables in schema1 as well as cross schema (i.e. schema1 & schema2) When I use copy = TRUE it takes time and gives WARNING : only 1,00,000 records copied and it is really time consuming.
I have checked https://github.com/hadley/dplyr/issues/244 but the pool method does not seem to work. Any help would be much appreciated or else learning dplyr would be of no use to serve my immediate purpose.