Import csv files into duckdb and show the output on the screen

539 views Asked by At

This question has been asked many times, but I have not been able to find a satisfactory solution. I use Linux and R. The dylr package displays some warning messages. When executing this script I get the following warning message:

source("/home/code-server/Workspace/test/new.R", encoding = "UTF-8")

Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

warning messages from top-level task callback 'vsc.workspace'
Warning message:
Database is garbage-collected, use dbDisconnect(con, shutdown=TRUE) or duckdb::duckdb_shutdown(drv) to avoid this. 

The source code:

# Add libraries
library(dplyr)
library(DBI)
library(duckdb)


path <- "data/"
setwd(path)

# write to disk as "Example", other defaults to in memory
con <- DBI::dbConnect(duckdb::duckdb(), "test-2022")
duckdb::duckdb_read_csv(
    conn = con, name = "Example_csv", files = "1a.csv",
    header = TRUE, delim = ","
)

Here some lines of Code to show the table. Unfortunately, these lines of code do not work.

con <- dbConnect(duckdb::duckdb(), dbdir = "test-2022", read_only = FALSE)
dbListTables(con)
dbDisconnect(con, shutdown = TRUE)


dbWriteTable(con, "test", test)
res = dbGetQuery(con, "SELECT * FROM "test-2022")
print(res)
2

There are 2 answers

0
tau31 On BEST ANSWER

Regarding your issues:

  1. Warning

Warning message: Database is garbage-collected, use dbDisconnect(con, shutdown=TRUE) or duckdb::duckdb_shutdown(drv) to avoid this.

this is happening because the file you source in source("/home/code-server/Workspace/test/new.R", encoding = "UTF-8") does not disconnect from con. As the warning indicates, you add dbDisconnect(con, shutdown = TRUE) at the end of that file.

Because you don't do so, the connection is later garbage collected and produces this warning. You can reproduce this by running the code below in a new session:

con <- dbConnect(duckdb::duckdb())
# call garbage collector 
gc()

you can find more info on the garbage collectionn here (https://adv-r.hadley.nz/names-values.html#gc)

  1. Regarding the code that doesn't work
con <- dbConnect(duckdb::duckdb(), dbdir = "test-2022", read_only = FALSE)
dbListTables(con)
dbDisconnect(con, shutdown = TRUE)


dbWriteTable(con, "test", test)
res = dbGetQuery(con, "SELECT * FROM "test-2022")
print(res)

There are at least to errors here:

  • You close the connection con (using dbDisconnect) but then you attempt to write the table test to the same connection (using dbWriteTable). This will not work because the connection has been disconnected.

  • As already mentioned above, the query in dbGetQuery is not properly constructed since you hare using nested double quotes. Moreoved, you are trying to import data from a table with the name of the database file when you set it up in dbConnect above instead of the importing data from test

At least, that is what I think you are doing, since you dind't share that many details on the error message.

This is what I think the code should look like:

con <- dbConnect(duckdb::duckdb(), dbdir = "test-2022", read_only = FALSE)
dbListTables(con)

dbWriteTable(con, "test", test)
res = dbGetQuery(con, "SELECT * FROM test") # or any other table you want results from

print(res)

dbDisconnect(con, shutdown = TRUE)

I hope it was helpful :)

3
r2evans On

Up front, I'm assuming your "does not work" is due to this error:

res = dbGetQuery(con, "SELECT * FROM "test-2022")
# Error: unexpected symbol in "res = dbGetQuery(con, "SELECT * FROM "test"

(If not, you will need to give the actual error text you see.)

You cannot have un-escaped double quotes inside a double-quote string, it will be a parsing error. The solution is often to use singles for one (outer/inner) and doubles for the other, or to backslash-escape the inner quotes (when the inner are the same as the outer).

While R tends to use single- and double-quotes interchangeably for strings, SQL does not, it needs quoted identifiers (e.g., table/column names) to be enclosed in double-quotes, so you can either backslash-escape them, or just use single-quotes for the outer ones.

Try either (or both?) of the below:

# double inside single
res <- dbGetQuery(con, 'SELECT * FROM "test-2022" ')
# escaped-double inside double
res <- dbGetQuery(con, "SELECT * FROM \"test-2022\" ")
# R's new-ish raw-strings
res <- dbGetQuery(con, r"(SELECT * FROM "test-2022" )")

(The trailing space between 22" and the closing single/double quote is purely to clearly separate them in this interface, it is not required. It will work just as easily with 22"', 22\"", and 22")", respectively.)


N.B.: if you've not heard of R's "raw strings" before, here's one reference: https://r4ds.hadley.nz/strings.html#sec-raw-strings, and the original announcement in R NEWS under "CHANGES IN R 4.0.0 >> SIGNIFICANT USER-VISIBLE CHANGES":

There is a new syntax for specifying raw character constants similar to the one used in C++: r"(...)" with ... any character sequence not containing the sequence ⁠)". This makes it easier to write strings that contain backslashes or both single and double quotes. For more details see ?Quotes.

Sure enough, in ?Quotes, we can see

     Raw character constants are also available using a syntax similar
     to the one used in C++: ‘r"(...)"’ with ‘...’ any character
     sequence, except that it must not contain the closing sequence
     ‘)"’.  The delimiter pairs ‘[]’ and ‘{}’ can also be used, and ‘R’
     can be used in place of ‘r’.  For additional flexibility, a number
     of dashes can be placed between the opening quote and the opening
     delimiter, as long as the same number of dashes appear between the
     closing delimiter and the closing quote.