Extremely slow simple SQLite queries from external drive

930 views Asked by At

I have a SQLite database on an external hard drive, and I would like to analyze a couple thousand subsets of it in R (or Python). The data are about 110 million observations of 13 variables, and it is stored (~10.5 GBs) on an external drive because I do not have sufficient storage on my computer. I opted to try and use a SQL database because while I do not have enough RAM to work with the entire dataset in R, I only need to access to a small percentage of it at any given time.

I built the database with RSQLite -- essentially by just appending a bunch data frames using dbWriteTable() -- and can successfully query it. The problem is that even a very simple query of the database is taking minutes to complete. For example, this would take about 3 minutes to run (in either R or with sqlite3 in Python):

# example in R
library(RSQLite)

# connect to existing database
conn <- dbConnect(RSQLite::SQLite(), EEGdata.db)

# example of simple query
testQuery <- dbSendQuery(conn, "SELECT * FROM Data WHERE Time = 42")

# return data (takes "forever" but does work)
testdf <- dbFetch(testQuery)

I've tried a number of different (but I think equivalent) ways to execute the query, for instance using dbplyr, but they all similarly take minutes to run.

I've tried using transactions (but I gather this mostly just helps for inserting or modifying information?).

I've tried using an index, which does speed up the queries, but they still take more than minute to complete.

I've tried using the latest R/RSQLite versions, as recommended in: SQLilte query much slower through RSqlite than sqlite3 command-line interface

I've tested that the general read/write speed of my external drive (WD my passport) is not horribly slow (it transfers data at about 100Mbs).

I've tried using sqlite3 through Python also, and get roughly the same very slow queries.

I imagine I could have made the database incorrectly somehow, but its puzzling why it would still function, just very poorly. Just in case, this is an example of how I created the database:

library(RSQLite)

# open connection to new database
conn <- dbConnect(RSQLite::SQLite(), EEGdata.db)

# loop over data files
for (file in filenames){

  # load file (column names defined elsewhere)
  temp <- read.delim(file = file, sep = '\t', header = F, col.names = columns) 

  # do some cleaning, merging, etc...

  # add to database
  dbWriteTable(conn, "Data", temp, append = TRUE)
}

Since the problem looks similar in both R and Python, I assume theres some issue with how I've created the SQLite database, or the fact it is stored on an external drive.

Any ideas on what might be going wrong, or what I could try?

EDITS:

Thanks for all the thoughts!

Regarding indexing, as mentioned by @Waldi and @r2evans, it does help - I did not mean to imply it doesn't, or that I wouldn't use an index. For the most complex query I would make (which is still only filtering by 3 variables), fully indexing cuts the time from ~6 minutes to ~1.5 minutes. So obviously the indices are worth using, but it alone doesn't seem to account for the slowness.

Regarding a smaller version of the database (@DavidP), good idea - I just tried a few things and found that in this case, the actual size of the database doesn't matter much for speed, but rather the size of the data being returned from the query (particularly the number of rows). But even in smaller cases, it feels slow - for instance returning just 900 rows, with everything fully indexed, takes 1.3 seconds. Which from what I can tell is still quite slow for a simple SQL query.

Regarding the speed of the HD (@Waldi and @dnoeth) - I suppose I didn't want to believe this was just a hardware issue, but thats definitely possible. I'll see if I can find a SSD somewhere to test this on and report back.

More information for @Parfait (and others): '''

sqlite> .schema Data
CREATE TABLE `Data` (
  `Item` INTEGER,
  `Amplitude` REAL,
  `Electrode` INTEGER,
  `Time` INTEGER,
  `Subject` TEXT,
  `Hstatus` TEXT,
  `Word` TEXT,
  `Zipf_freq` REAL,
  `OLD` REAL,
  `Concreteness` REAL,
  `Visual_complexity` REAL,
  `Letter_number` REAL,
  `Morpheme_number` REAL
);
CREATE INDEX time_idx ON Data (Time);
CREATE INDEX idx ON Data (Hstatus, Electrode, Time);

'''

Regarding not using SELECT * - I tried only selecting a single variable instead of all, and it did not really improve speed. (and in this application, I've already cut out the information I don't need, so except for the index variables, I need to select everything).

1

There are 1 answers

1
HCAI On

(aside from the external HD, which is definitely a bottleneck) I had a similar issue and it was down to sqlite database itself. You can switch to DuckDB and use the same query code. For me it was more than 10x faster on a slightly larger file with 100 million rows. https://duckdb.org/docs/api/r