I am trying to backup some data from a table with > 50M rows. The table looks like this
#
entry_id : int unsignedauto_increment # unique entry id
---
time=CURRENT_TIMESTAMP : timestamp # current timestamp
room : varchar(255) # Room
id : varchar(255) # Sensor ID / NAME
value : double # sensor value
My plan is to
- Fetch a limited amount of KEYS (i.e.
fetch('KEY', limit=some_large_number)
) - Retrieve table entries for those KEYS
- Save data chunk on disk
- Delete table entries belonging to KEYS
- Start over ...
Just fetching, let's say 1_000_000 entries, is reasonably fast (a few seconds), however if I want to go through retrieving primary keys -> fetching data -> deleting those entries - I am basically stuck at fetching the data (>20 minutes for ~ 100_000 entries or timeout(?)).
What is the most elegant / time efficient way to perform this task using datajoint logic?
(Datajoint python version '0.13.2'
)
If your data are not changing, then you can use the
limit
andoffset
keywords to stride through the table, fetching in chunks without deleting or restricting. There is no need to retrieve the primary key first, to restrict the fetch, or to delete the block.If this is a simple backup, then deletes are not necessary. If you need to delete the data, you can delete it all at once at the end.