How to atomic select rows in Mysql?

330 views Asked by At

I have 5+ simultaneously processes selecting rows from the same mysql table. Each process SELECTS 100 rows, PROCESS IT and DELETES the selected rows.

But I'm getting the same row selected and processed 2 times or more.

How can I avoid it from happening on MYSQL side or Ruby on Rails side?

The app is built on Ruby On Rails...

3

There are 3 answers

0
MarkR On

Eric's answer is good, but I think I should elaborate a little...

You have some additional columns in your table say:

lockhost VARCHAR(60),
lockpid INT,
locktime INT, -- Or your favourite timestamp.

Default them all to NULL.

Then you have the worker processes "claim" the rows by doing:

UPDATE tbl SET lockhost='myhostname', lockpid=12345,
 locktime=UNIX_TIMESTAMP() WHERE lockhost IS NULL ORDER BY id
 LIMIT 100

Then you process the claimed rows with SELECT ... WHERE lockhost='myhostname' and lockpid=12345

After you finish processing a row, you make whatever updates are necessary, and set lockhost, lockpid and locktime back to NULL (or delete it).

This stops the same row being processed by more than one process at once. You need the hostname, because you might have several hosts doing processing.

If a process crashes while it is processing a batch, you can check if the "locktime" column is very old (much older than processing can possibly take, say several hours). Then you can just reclaim some rows which have an old "locktime" even though their lockhost is not null.

This is a pretty common "queue pattern" in databases; it is not extremely efficient. If you have a very high rate of items entering / leaving the queue, consider using a proper queue server instead.

0
chrispanda On
0
Eric On

Your table appears to be a workflow, which means you should have a field indicating the state of the row ("claimed", in your case). The other processes should be selecting for unclaimed rows, which will prevent the processes from stepping on each others' rows.

If you want to take it a step further, you can use process identifiers so that you know what is working on what, and maybe how long is too long to be working, and whether it's finished, etc.

And yeah, go back to your old questions and approve some answers. I saw at least one that you definitely missed.