Can SnakeMake be forced to rerun rules when files are missing

8.8k views Asked by At

When a file that was made earlier in the pipeline is removed, SnakeMake does not seem to consider that a problem, as long as later files are there:

rule All:
    input: "testC1.txt", "testC2.txt"

rule A:
    input: "{X}{Y}.txt"
    output: "{X}A{Y}.txt"
    shell: "cp {input} {output}"

rule B:
    input: "{X}A{Y}.txt"
    output: "{X}B{Y}.txt"
    shell: "cp {input} {output}"

rule C:
    input: "{X}B{Y}.txt"
    output: "{X}C{Y}.txt"
    shell: "cp {input} {output}"

Save this SnakeFile in test.sf and do this:

rm testA*.txt testB*.txt testC*.txt
echo "test1" >test1.txt
echo "test2" >test2.txt
snakemake -s test.sf
# Rerun:
snakemake -s test.sf
# SnakeMake says all is up to date, which it is.
# Remove intermediate results:
rm testA1.txt
# Rerun:
snakemake -s test.sf

SnakeMake says all is up to date. It does not detect missing testA1.txt.

I seem to recall something in the online SnakeMake manual about this, but I can no longer find it.

I assume this is expected SnakeMake behavior. It can sometimes be desired behavior, but sometimes you may want it to detect and rebuild the missing file. How can this be done?

3

There are 3 answers

0
Jon Chung On

I found this thread a while ago about the --forcerun/-R parameter that might be informative.

Ultimately, snakemake will force execution of the entire pipeline if you want to regenerate that intermediate file without having a separate rule for it or including it as a target in all.

0
Sebastian Müller On

Indeed, it would be nice if snakemake had a flag which looked for missing intermediate results and regenerates them if missing (an all it's dependencies). I'm not aware of such an option, but there are some workarounds. Note, the -R option suggested by m00am and Jon Chung will regenerate all other files reagardless of wheather intermediate files are missing or not. So this is not ideal at all.

Workaround 1: Force recreation of file

Force recreation of the intermediate file using -R or -f flag (help copied below). The key here to be explicit targetting the file rather than the rule.

snakemake -s test.sf testA1.txt # only works if testA1.txt was deleted
# or
snakemake -s test.sf -R testA1.txt # testA1.txt can be present or absent
# or
snakemake -s test.sf -f testA1.txt
# or
snakemake -s test.sf -F testA1.txt

Note, the later for the latter two, the pipeline need to be run again to update dependencies:

snakemake -s test.sf 

prevent update of dependent files (by touching files)

If you don't want the dependent files (i.e. testB1.txt, testC1.txt) to be updated there are also options.

You could regenerate testA1.txt and then "reset" it's modification time, e.g. to the source file which will prevent the pipeline to update anything:

snakemake -s test.sf -f testA1.txt
touch testA1.txt -r test1.txt

snakemake -s test.sf now won't to anything since testB1.txt is newer than testA1.txt

Or you could mark the dependent files (i.e. testB1.txt, testC1.txt) as "newer" using --touch:

snakemake -s test.sf -f testA1.txt
snakemake -s test.sf --touch

Workaround 2: Creating a new rule

The snakefile could be extended by a new rule:

rule A_all:
    input: "testA1.txt", "testA2.txt"

Which could then be called like so:

snakemake A_all -s test.sf

This will only generate testA1.txt, similar to -f in the workflow above, so the the pipline needs to be rerun or the modification time can to be changed.

A trick might to "update" a intermediate file using --touch

snakemake -s test.sf --touch testA1.txt -n

This will "update" testA1.txt. To recreate the dependent files snakemake needs to be run as normal afterwards:

snakemake -s test.sf

Note this will not work if testA1.txt was deleted, this needs to be done instead of deletion.

Relevant help on used parameters:

  --touch, -t           Touch output files (mark them up to date without
                        really changing them) instead of running their
                        commands. This is used to pretend that the rules were
                        executed, in order to fool future invocations of
                        snakemake. Fails if a file does not yet exist.

  --force, -f           Force the execution of the selected target or the
                        first rule regardless of already created output.
  --forceall, -F        Force the execution of the selected (or the first)
                        rule and all rules it is dependent on regardless of
                        already created output.
  --forcerun [TARGET [TARGET ...]], -R [TARGET [TARGET ...]]
                        Force the re-execution or creation of the given rules
                        or files. Use this option if you changed a rule and
                        want to have all its output in your workflow updated.
0
m00am On

As mentioned in this other answer, the -R parameter can help, but there are more options:

Force a rebuild of the whole workflow

When you call

snakemake -F

this will trigger a rebuild of the whole pipeline. This basically means, forget all intermediate files and start anew. This will definitely (re-) generate all intermediate files on the way. The downside is: it might take some time.

Force a specific rule

This is the realm of the -R <rule> parameter. This re-runs the given rule and all rules that depend on it. So in your case

snakemake -R A -s test.sf

would rerun rule A (to build testA1.txt from test.txt) and the rules B, C and All, since they depend on A. Mind that this runs all copies of rule A that are required, so in your example testA2.txt and everything that follows from it is also rebuild.

If, in your example, you would have removed testB1.txt instead, only the rules B and C would have been rerun.

Why does this happen?

If I remember correctly, snakemake detects if a file needs to be rebuild by its utime. So if you have a version of testA1.txt that is younger (as in more recently created) than testB1.txt, testB1.txt has to be rebuild using rule B, to assure everything is up to date. Hence, you cannot easily rebuild only testA1.txt without also building all following files unless you somehow change the files' utimes.

I have not tried this out, but this can be done with snakemakes --touch parameter. If you manage to only run rule A and then run snakemake -R B -t ,which touches all output files of the rules B and following, you could get a valid workflow state without actually rerunning all steps in between.