How to speed up importing to data.frame?

167 views Asked by At

I need to analyse a large dataset (~40Go, 3 million rows); too big to open in a spreadsheet or R. To address that, I loaded it into an SQLite database, then used R (and RSQLite) to split into parts I can manipulate (70,000 rows). I need it in data.frame format. I used as.data.frame :

#Connecting to the database
con = dbConnect(drv=RSQLite::SQLite(),dbname="path")

#Connecting to the table
d=tbl(con, "Test")

#Filter the database and convert it
d %>% 
   #I filtered using dplyr
   filter("reduce number of rows") %>% 
   as.data.frame()

It works but takes a lot of time. How to make this faster (I have limited RAM)? I also tried setDT(), but it doesn't work on SQLite data:

d %>% setDT()

Error in setDT(.) : 
All elements in argument 'x' to 'setDT' must be of same length, but the profile of input lengths (length:frequency) is: [2:1, 13:1]
The first entry with fewer than 13 entries is 1
1

There are 1 answers

0
G. Grothendieck On

To process successive chunks of 70000 rows using con from the question replace the print statement below with any processing desired (base, dplyr, data.table, etc.) .

rs <- dbSendQuery(con, "select * from Test")
while(!dbHasCompleted(rs)) {
  dat <- dbFetch(rs, 70000)
  print(dim(dat)) # replace with your processing
}
dbClearResult(rs)
dbDisconnect(con)