Which Database engine for large dataset

172 views Asked by At

I'm working on a analysis assignment, we got a partial data-set from the university Library containing almost 300.000.000 rows.

Each row contains:

  • ID
  • Date
  • Owner
  • Deadline
  • Checkout_date
  • Checkin_date

I put all this inside a MySQL table, then I started querying that for my analysis assignment, however simple query (SELECT * FROM table WHERE ID = something) where taking 9-10 minutes to complete. So I created an index for all the columns, which made it noticeable faster ~ 30 sec.

So I started reading similar issues, and people recommended switching to a "Wide column store" or "Search engine" instead of "Relational".

So my question is, what would be the best database engine to use for this data?

2

There are 2 answers

1
Alain Collins On

The answer is, of course, "it depends". In your example, you're counting the number of records in the database with a given ID. I find it hard to believe that it would take 30 seconds in MySQL, unless you're on some sluggish laptop.

MySQL has powered an incredible number of systems because it is full-featured, stable, and has pretty good performance. It's bad (or has been bad) at some things, like text search, clustering, etc.

Systems like Elasticsearch are good with globs of text, but still may not be a good fit for your system, depending on usage. From your schema, you have one text field ("owner"), and you wouldn't need Elasticsearch's text searching capabilities on a field like that (who ever needed to stem a user name?). Elasticsearch is also used widely for log files, which also don't need a text engine. It is, however, good with blocks of text and with with clustering.

If this is a class assignment, I'd stick with MySQL.

3
dadoonet On

Using a search engine to search is IMO the best option.

Elasticsearch of course!

It's built to hold whatever type of data, text, numerics, geopoints, shapes and it's built for search and computation. It can easily scale out anytime you need to ingest even more data.

On a previous project (in 2011), I have been able to transform batches which were running for hours during the night to real time insights...

It's very common to see Elasticsearch clusters with some gb or pb of data nowadays...

Disclaimer: I work at elastic. :)