Generate many files with wildcard, then merge into one

252 views Asked by At

I have two rules on my Snakefile: one generates several sets of files using wildcards, the other one merges everything into a single file. This is how I wrote it:

chr = range(1,23)

rule generate:
    input:
        og_files = config["tmp"] + '/chr{chr}.bgen',
    output:
        out = multiext(config["tmp"] + '/plink/chr{{chr}}',
                       '.bed', '.bim', '.fam')
    shell:
        """
        plink \
        --bgen {input.og_files} \
        --make-bed \
        --oxford-single-chr \
        --out {config[tmp]}/plink/chr{chr}
        """
rule merge:
    input:
        plink_chr = expand(config["tmp"] + '/plink/chr{chr}.{ext}',
                           chr = chr,
                           ext = ['bed', 'bim', 'fam'])
    output:
        out = multiext(config["tmp"] + '/all',
                       '.bed', '.bim', '.fam')
    shell:
        """
        plink \
        --pmerge-list-dir {config[tmp]}/plink \
        --make-bed \
        --out {config[tmp]}/all
        """

Unfortunately, this does not allow me to track the file coming from the first rule to the 2nd rule:

$ snakemake -s myfile.smk -c1 -np                                                                           
Building DAG of jobs...                                                                                                                                       
MissingInputException in line 17 of myfile.smk:                            
Missing input files for rule merge: 
[list of all the files made by expand()]   

What can I use to be able to generate the 22 sets of files with the wildcard chr in generate, but be able to track them in the input of merge? Thank you in advance for your help

1

There are 1 answers

0
dariober On BEST ANSWER

In rule generate I think you don't want to escape the {chr} wildcard, otherwise it doesn't get replaced. I.e.:

        out = multiext(config["tmp"] + '/plink/chr{{chr}}',
                       '.bed', '.bim', '.fam')

should be:

        out = multiext(config["tmp"] + '/plink/chr{chr}',
                       '.bed', '.bim', '.fam')