I'm using Snakemake on a cluster, and I don't know how best to handle the fact that some jobs can be preempted.
For more power on the cluster I use, it is possible to have access to the resources of other teams, but with the risk of being preempted, which consists in stopping the job in progress, and rescheduling it. It will be launched again as soon as a resource is available. This is especially advantageous when you have a lot of quick jobs to run. Unfortunately, I don't have the impression that Snakemake supports this properly.
In the example given in the help on the cluster-status
feature for Slurm, there is no PREEMPTED
in the running_status list (running_status=["PENDING", "CONFIGURING", "COMPLETING", "RUNNING", "SUSPENDED"]
), which may lead to consider a preempted job has failed. Not a big deal, I’ve added PREEMPTED
to this list, but I am led to believe that Snakemake did not consider this scenario.
More annoyingly, even when running Snakemake with the --rerun-incomplete
option, when the job is interrupted by the preemption, then restarted, I get the following error:
IncompleteFilesException:
The files below seem to be incomplete. If you are sure that certain files are not incomplete, mark them as complete with
snakemake --cleanup-metadata <filenames>
To re-generate the files rerun your command with the --rerun-incomplete flag.
I would expect the interrupted job to restart from scratch.
For now, the only solution I have found is to stop using other teams' resources to avoid having my jobs preempted, but I am losing computing power.
How do you use Snakemake in a context where your jobs can be preempted? Anyone see a solution so I don't get the IncompleteFilesException anymore?
Thanks in advance
Thanks for reporting these, I see two separate issues here:
PREEMPTED
status returned by slurm.IncompleteFilesException
suggesting you use--rerun-incomplete
when that is exactly what you are doing.1.
PREEMPTED
status handlingI have no experience in using slurm, so I cannot comment if the script example in the docs that you are linking to will work for slurm. Especially the expression in
output = str(subprocess.check_output(expression))
might have to be adjusted to slurm in some way. Maybe there's someone around here who also uses slurm and has found a working solution in the past?But otherwise, adding
PREEMPTED
to therunning_status
list should be exactly what you want to do (assuming that that is exactly the tag returned byexpression
).If this has to be adapted to slurm and you manage to generate a working
status.py
script, it might be worth adding this to the docs via a pull request onto this file, so that other slurm users don't have to reinvent the solution.2.
IncompleteFilesException
with--rerun-incomplete
flagFrom the general description, this sounds a bit like a bug. But without any details, I cannot be sure. But maybe it's worth describing this in some more detail while filing an issue in the snakemake repo. Either simply by providing more details, or by even providing a minimal example to reproduce this behavior.