Worker management for durable queue in Redis

609 views Asked by At

It is a well-known pattern to use LPUSH and BRPOPLPUSH (http://redis.io/commands/rpoplpush) to implement a durable queue in Redis. However, in order to scale up, the design needs to cater for multiple workers/consumers that BRPOPLPUSH from the main task queue.

So the norm seems to be that for each worker there is a separate processing_queue that records the task a specific worker is working on, such that the worker keeps track of what is left to do in case that it exits during processing.

I have two questions regarding this processing_queue:

  1. Is it correct to reason that there will be at most one item/task at any time in a worker's processing_queue? I assume that a worker starts by checking any left-over task in its own processing_queue before BRPOPLPUSH the main task queue. If so, we can use any of RPOP, LPOP, LREM to remove a task once the worker finishes processing (or it can simply delete the list). We can even use a set instead of a list. Is there any reason why so many people choose to use LREM but nothing else?

  2. I have seen many people recommend to identify the individual processing_queue using the process ID of the corresponding worker. But what happens when the old worker exits and a new one gets spawned with a (most likely) new process ID. How does the new worker look up its predecessor's processing_queue to finish a possible left-over task? I plan to use Supervisor to manage my worker processes, if that makes a difference.

1

There are 1 answers

0
Itamar Haber On
  1. That depends. If you're only interested in having the worker pull a job from the main queue, process it, rinse and repeat then yes. However, this isn't a hard requirement but rather a choice of design. For example, you can have the worker push additional items into the processing_queue when you want to serialize jobs that are derived from the main one.

  2. I'm not familiar with Supervisor but you can keep a separate data structure - a Sorted Set would be best here - with your workers' identifiers as elements and a timestamp as their score. Have your workers update their timestamps periodically and when you launch a new worker, have it check that set for workers who've been idle too long. When such are found, try ascertaining that it is in fact dead and then have the new worker take over the relevant processing_queue (i.e.g. with a RENAME).