How to recover Mesos executor after Mesos framework failure?

428 views Asked by At

My scenario is that a framework is running on server A. It has an executor on server B running a task (a long running web service with a long initialization time). Server A is shutdown. The framework is then restarted somewhere else in the cluster.

Currently, after the restart the new framework registers a new executor which runs a new task. After some time, the Mesos master deactivates the old and no-longer-running framework which in turn kills the old but still-running executor and its task.

I would like the new framework to re-register the old executor rather than register a new one. Is this possible?

1

There are 1 answers

0
mab On BEST ANSWER

This on the Mesos forum answers my question:

http://www.mail-archive.com/user%40mesos.apache.org/msg00069.html

Included here for reference:

(1) One thing particular I found unexpected is that the executors are shutdown if the scheduler is shutdown. Is there a way to keep executors/tasks running when the scheduler is down? I would imagine when the scheduler comes back, it could reestablish the state somehow and keep going without interrupting the running tasks. Is this a use case that mesos is designed for?

You can use FrameworkInfo.failover_timeout to tell Mesos how long to wait for the framework to re-register before it cleans up the framework's executors and tasks.

Also, note that for this to work the framework has to persist its frameworkId when it first registers with the master. When the framework comes back up it needs to reconnect by setting FrameworkInfo.framework_id = persisted id.