How to make Bluepill restart Resque workers only after reaching a safe status

541 views Asked by At

Let's say this is my worker:

class FooWorker
  @queue = :foo

  def self.perform
    User.all.each do |u|
      ...
      Do Long Operations (Unsafe to kill)
      ...

      # Here it's safe to break the worker and restart
    end
  end
end

I'm enqueing this with Resque Scheduler and this is my Bluepill conf:

...
app.process(process_name) do |process|
  process.group         = "resque"
  process.start_command = "rake environment resque:work QUEUE=foo RAILS_ENV=production"
  ...
  process.stop_signals  = [:quit, 5.seconds, :term, 1.minute, :kill]
  process.daemonize     = true

  process.start_grace_time = 30.seconds
  process.stop_grace_time  = 80.seconds

  process.monitor_children do |child_process|
    child_process.stop_command = "kill -QUIT {{PID}}"

    child_process.checks :mem_usage, :every => 30.seconds, :below => 500.megabytes, :times => [3,4], :fires => :stop
  end
end
....

I'd like to make Bluepill or Resque wait until it reaches the "safe" block to restart or shut down. How to achieve this?

1

There are 1 answers

0
biomancer On

Try it this way:

1) Set resque to kill children gracefully on TERM/INT with new_kill_child method by setting TERM_CHILD and RESQUE_TERM_TIMEOUT env variables on start:

process.start_command = "rake environment resque:work QUEUE=foo RAILS_ENV=production TERM_CHILD=1 RESQUE_TERM_TIMEOUT=20.0"

Default value for RESQUE_TERM_TIMEOUT is 4 seconds.

This will make resque send TERM signal to child, wait for RESQUE_TERM_TIMEOUT and if child is still running, kill it. Be sure to

a) set this timeout large enough for your critical section to end,

b) configure Bluepill TERM timeout in process.stop_signals to be a bit larger than RESQUE_TERM_TIMEOUT not to kill worker while it waits for child process to end critical section.

2) Handle TERM signal in child process to stop gracefully:

class FooWorker
  class << self
    attr_accessor :stop
  end

  @queue = :foo
  def self.perform
    User.all.each do |u|
      ...
      Do Long Operations (Unsafe to kill)
      ...

      # Here it's safe to break the worker and restart
      return if FooWorker.stop
    end
  end
end

trap('TERM') do
  FooWorker.stop = true
end