dbplyr - Error: x and y don't share the same src. Set copy = TRUE to copy y into x's source (this may be time consuming)

11.5k views Asked by At

Normally we do not find any trouble in using the below connection method and run queries from redshift

require("RPostgreSQL")
drv <- dbDriver("PostgreSQL")
conn <- dbConnect(drv, dbname = "redshiftdb",
 host = "XX.XX.XX.XX", port = "1234",
 user = "userid", password = "pwd")

my_data <- dbGetQuery(conn, "select a.*, b.* from redshiftdb.schema1.table1 inner join redshiftdb.schema2.table2 on a.key = b.key")

But the problem with this method is people can use long complex SQL queries which becomes hard to debug and illustrate while re engineering. Unless I am hard core SQL coder.

I have been learning R language since September and I thought it would it would be interesting to use dplyr joins and pipes to do the same work. I connected using

conn <- src_postgres(dbname = "redshiftdb",
 host = "XX.XX.XX.XX", port = 1234,
 user = "userid",
 password = "pwd")
my_tbl1 <- tbl(conn, dplyr::sql('select * from schema1.table1'))
my_tbl2 <- tbl(conn, dplyr::sql('select * from schema1.table2'))
my_tbl3 <- tbl(conn, dplyr::sql('select * from schema1.table3'))
my_tbl4 <- tbl(conn, dplyr::sql('select * from schema1.table4'))
my_tbl5 <- tbl(conn, dplyr::sql('select * from schema2.table1'))
my_tbl6 <- tbl(conn, dplyr::sql('select distinct var1, var2 from schema2.table2'))
my_tbl7 <- tbl(conn, dplyr::sql('select * from schema2.table3'))

I get the above error using left_join and %>% to join across tables in schema1 as well as cross schema (i.e. schema1 & schema2) When I use copy = TRUE it takes time and gives WARNING : only 1,00,000 records copied and it is really time consuming.

I have checked https://github.com/hadley/dplyr/issues/244 but the pool method does not seem to work. Any help would be much appreciated or else learning dplyr would be of no use to serve my immediate purpose.

0

There are 0 answers