Fastest way to backup / delete from table

128 views Asked by At

I am trying to backup some data from a table with > 50M rows. The table looks like this

# 
entry_id             : int unsignedauto_increment   # unique entry id
---
time=CURRENT_TIMESTAMP : timestamp                  # current timestamp
room                 : varchar(255)                 # Room
id                   : varchar(255)                 # Sensor ID / NAME
value                : double                       # sensor value

My plan is to

  1. Fetch a limited amount of KEYS (i.e. fetch('KEY', limit=some_large_number))
  2. Retrieve table entries for those KEYS
  3. Save data chunk on disk
  4. Delete table entries belonging to KEYS
  5. Start over ...

Just fetching, let's say 1_000_000 entries, is reasonably fast (a few seconds), however if I want to go through retrieving primary keys -> fetching data -> deleting those entries - I am basically stuck at fetching the data (>20 minutes for ~ 100_000 entries or timeout(?)).

What is the most elegant / time efficient way to perform this task using datajoint logic?

(Datajoint python version '0.13.2')

1

There are 1 answers

2
Dimitri Yatsenko On BEST ANSWER

If your data are not changing, then you can use the limit and offset keywords to stride through the table, fetching in chunks without deleting or restricting. There is no need to retrieve the primary key first, to restrict the fetch, or to delete the block.

step = 1_000_000
for chunk in range((len(table) + step - 1) // step):
    block = table.fetch(limit=step, offset=step * chunk)
    ... # save block

If this is a simple backup, then deletes are not necessary. If you need to delete the data, you can delete it all at once at the end.