MemSQL database recovery

1.4k views Asked by At

We have single node MemSQL community edition running in production though MemSQL single node is not recommended for production use , we started as POC and deployed the same to Prod

Today we ran into the below issue ,

(stderr) ERROR 1720 (HY000) at line 3: Leaf Error (10.0.0.28:3307): Memory usage by MemSQL for tables (26255 MB) has reached the value of 'maximum_table_memory' global variable (26064 MB). This query cannot be executed.

And we tried to delete some data and we were not able to do so because of the below exception Error Code: 1712. Leaf Error (10.0.0.28:3307): Not enough memory available to complete the current request. The request was not processed. 46.836 sec

Then we doubled the RAM on the machine and restarted memsql , but leaf didn't picked up the additional memory , then I have changed the memory setting at memsql.cnf ( /var/lib/memsql/leaf-3307/ ) and restarted the leaf node and then the leaf picked up the additional memory

Ideally memory shouldn't be hard coded , instead it should be percentage of memory available over the machine and after restart it should pick the additional memory

When leaf picked up the additional memory , we started encountering different set of issues

We were continuously getting the below 2 exceptions and as and when we are able to connect and restart the application (ETL) , we were again running into the same problem , we tried 10 times and we have no clue what's happening , we tried restarting memsql , we tried rebalancing paritions (we know it doesn't work but still) , we tried repairing database but the DB is in online mode , when the leaf picked up additional memory after bumping the machine config (AWS) , we purged (deleted) some data from one of the table , so we thought this might be creating problem and recreated the table but still no luck

ERROR 1731 (HY000): The database 'reports_and_summary' will be available to query in 35 seconds after recovery from disk is finished. Run SHOW DATABASES EXTENDED and visit http://docs.memsql.com/5.5/concepts/database#states for more information

"Primary key recovery of database memsql"

Finally the only thing which worked is , we upgraded the memsql version and I think it did a clean install again and it started working , but what if no new version is available for upgrade :)

Did any one faced similar issue and what can be the root cause ?

1

There are 1 answers

8
Adam Prout On

1) If you don't explicitly set maximum_memory in your memsql.cnf file then MemSQL will set maximum_memory to 90% of the physical memory on your machine and maximum_table_memory to 80% of the physical memory on your machine. See http://docs.memsql.com/docs/memory-management for more details. So, if you add more memory and restart MemSQL it will pick up and use the new memory as long as its not constrained by a maximum_memory setting in your .cnf file.

2) MemSQL needs to use memory to run DELETE queries (its a multi-versioning database - DELETE queries don't physically delete rows right away, they mark them as deleted. When the DELETE commits the rows can by deleted and memory freed up if no other query is using the rows). If you hit an out-of-memory error running a delete the easiest way out is to delete fewer rows in a single delete statement (i.e, put a LIMIT 10000 on the delete and run several DELETEs instead of one large delete which requires more memory then available). If your ok deleting all data from a table then TRUNCATE TABLE uses much less memory then DELETE. You could also run SET GLOBAL maximum_memory and adjust maximum_memory to a higher value but this isn't recommended.

3) If you hit a "database is recovering" error its because MemSQL has not finished reloading all the data from disk into memory after a restart. If you wait for it to finish recovering then the data will be queryable. MemSQL is a memory optimized database, so all data must be in memory before it will allow queries to run. If you have MemSQL enterprise edition you can run with redundancy 2 and then you won't have to wait for recovery (there will another copy of the data already stored in memory on another leaf node).