Error reading csv files larger than 5 GB in R and duckdb

160 views Asked by At

I will load data sets with more than 5Gb per each in duckdb. I need a little help. I start R in the VS Code editor. After some minutes the R stops and gives the message, reload the window. I get an empty example.wal file. The duckbd database is 12 kB big. The output from the dataset is 3 columns with header.

# Add libraries
library(duckdb)
library(dplyr)
library(DBI)

# write to disk as "Example", other defaults to in memory
con <- DBI::dbConnect(duckdb::duckdb(), "Example")
duckdb::duckdb_read_csv(
    conn = con, name = "Example_csv", files = "data/more/Example-2022.csv",
    header = TRUE, delim = ",", na.strings = "NA"
)

DBI::dbListTables(con)

When I use a data set with less data than I get this error message:

Error: rapi_execute: Failed to run query
Error: Invalid Input Error: Error in file "example.csv", on line 3: expected 1 values per row, but got more. (  file=example.csv
  delimiter=','
  quote='"'
  escape='"' (default)
  header=1
  sample_size=20480
  ignore_errors=0
  all_varchar=0)
In addition: Warning messages:
1: In read.table(file = file, header = header, sep = sep, quote = quote,  :
  line 1 appears to contain embedded nulls
2: In read.table(file = file, header = header, sep = sep, quote = quote,  :
  line 2 appears to contain embedded nulls
3: In read.table(file = file, header = header, sep = sep, quote = quote,  :
  line 3 appears to contain embedded nulls
4: In read.table(file = file, header = header, sep = sep, quote = quote,  :
  line 4 appears to contain embedded nulls
5: In read.table(file = file, header = header, sep = sep, quote = quote,  :
  line 5 appears to contain embedded nulls
6: In scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  :
  embedded nul(s) found in input
7: Database is garbage-collected, use dbDisconnect(con, shutdown=TRUE) or duckdb::duckdb_shutdown(drv) to avoid this. 
> Error: Invalid Input Error: Error in file "example.csv", on line 3: expected 1 values per row, but got more.

Some lines from the data set:

DateTime,Beta,Alpha
01/02/2022 22:03:13.151,0.83987,0.84129
01/02/2022 22:05:03.942,0.83959,0.84143
01/02/2022 22:05:09.121,0.83982,0.84124
01/02/2022 22:05:09.286,0.83978,0.8412
0

There are 0 answers