I'm using Snakemake on a cluster, and I don't know how best to handle the fact that some jobs can be preempted.

For more power on the cluster I use, it is possible to have access to the resources of other teams, but with the risk of being preempted, which consists in stopping the job in progress, and rescheduling it. It will be launched again as soon as a resource is available. This is especially advantageous when you have a lot of quick jobs to run. Unfortunately, I don't have the impression that Snakemake supports this properly.

In the example given in the help on the cluster-status feature for Slurm, there is no PREEMPTED in the running_status list (running_status=["PENDING", "CONFIGURING", "COMPLETING", "RUNNING", "SUSPENDED"]), which may lead to consider a preempted job has failed. Not a big deal, I’ve added PREEMPTED to this list, but I am led to believe that Snakemake did not consider this scenario.

More annoyingly, even when running Snakemake with the --rerun-incomplete option, when the job is interrupted by the preemption, then restarted, I get the following error:

IncompleteFilesException:
The files below seem to be incomplete. If you are sure that certain files are not incomplete, mark them as complete with

    snakemake --cleanup-metadata <filenames>

To re-generate the files rerun your command with the --rerun-incomplete flag.

I would expect the interrupted job to restart from scratch.

For now, the only solution I have found is to stop using other teams' resources to avoid having my jobs preempted, but I am losing computing power.

How do you use Snakemake in a context where your jobs can be preempted? Anyone see a solution so I don't get the IncompleteFilesException anymore?

Thanks in advance

2

There are 2 answers

1
dlaehnemann On

Thanks for reporting these, I see two separate issues here:

  1. Handling of the PREEMPTED status returned by slurm.
  2. The IncompleteFilesException suggesting you use --rerun-incomplete when that is exactly what you are doing.

1. PREEMPTED status handling

I have no experience in using slurm, so I cannot comment if the script example in the docs that you are linking to will work for slurm. Especially the expression in output = str(subprocess.check_output(expression)) might have to be adjusted to slurm in some way. Maybe there's someone around here who also uses slurm and has found a working solution in the past?

But otherwise, adding PREEMPTED to the running_status list should be exactly what you want to do (assuming that that is exactly the tag returned by expression).

If this has to be adapted to slurm and you manage to generate a working status.py script, it might be worth adding this to the docs via a pull request onto this file, so that other slurm users don't have to reinvent the solution.

2. IncompleteFilesException with --rerun-incomplete flag

From the general description, this sounds a bit like a bug. But without any details, I cannot be sure. But maybe it's worth describing this in some more detail while filing an issue in the snakemake repo. Either simply by providing more details, or by even providing a minimal example to reproduce this behavior.

0
Johannes Köster On

Snakemake has a restart feature, which can be used to let jobs be resubmitted automatically. However, there is no special handling for prememption currently, indeed. You are also right, I was not even aware that something like that exists on slurm. A PR in that direction would be welcome of course. Basically, one would need to extend the status script handling to recognize this and in that case restart the job.