Accepting slightly different inputs to snakemake rule (.fq vs .fq.gz)

785 views Asked by At

I am new to snakemake and would like to be able to take either a pair of .fq files or a pair of .fq.gz files and run them through trim_galore to get a pair of trimmed .fq.gz output files. Without giving all of my Snakefile, I have the below ugly solution where I just copied the rule and changed the inputs. What would be a better solution?

#Trim galore paired end trimming rule for unzipped fastqs:
rule trim_galore_unzipped_PE:
    input:
        r1=join(config['fq_in_path'], '{sample}1.fq'),
        r2=join(config['fq_in_path'], '{sample}2.fq'),
    output:
        r1=join(config['trim_out_path'], '{sample}1_val_1.fq.gz'),
        r2=join(config['trim_out_path'], '{sample}2_val_2.fq.gz'),
    params:
        out_path=config['trim_out_path'],
    conda:
        'envs/biotools.yaml',
    shell:
        'trim_galore --gzip -o {params.out_path} --paired {input.r1} {input.r2}'

#Trim galore paired end trimming rule for gzipped fastqs:
rule trim_galore_zipped_PE:
    input:
        r1=join(config['fq_in_path'], '{sample}1.fq.gz'),
        r2=join(config['fq_in_path'], '{sample}2.fq.gz'),
    output:
        r1=join(config['trim_out_path'], '{sample}1_val_1.fq.gz'),
        r2=join(config['trim_out_path'], '{sample}2_val_2.fq.gz'),
    params:
        out_path=config['trim_out_path'],
    conda:
        'envs/biotools.yaml',
    shell: 
        'trim_galore --gzip -o {params.out_path} --paired {input.r1} {input.r2}'
1

There are 1 answers

0
TBoyarski On BEST ANSWER

Using input functions is likely the best solution, being as follows:

  1. Pass wildcard to input function
  2. Using known YAML values, build the theoretical file names using that sample name.
  3. Use python functions to check which file (file suffixes technically) are valid
  4. Build a list of valid files
  5. Return and unpack the list of valid files.

Notes:

  • Input and output should have the same wildcards, if they don't it causes issues
  • In the input function, make sure it cannot return a null string, as Snakemake interprets this as a "lack of input" requirement, which is not what you want.
  • If you adopt these suggestions, update the rule name, I forgot to.

Snakefile:

 configfile: "config.yaml"

 from os.path import join
 from os.path import exists

 rule all:
     input:
         expand("{trim_out_path}/{sample}.{readDirection}.fq.gz",
             trim_out_path=config["trim_out_path"],
             sample=config["sampleList"],
             readDirection=['1','2'])


 def trim_galore_input_determination(wildcards):
     potential_file_path_list = []
     # Cycle through both suffix possibilities:
     for fastqSuffix in [".fq", ".fq.gz"]:

         # Cycle through both read directions
         for readDirection in ['.1','.2']:

             #Build the list for ech suffix
             potential_file_path = config["fq_in_path"] + "/" + wildcards.sample + readDirection + fastqSuffix

             #Check if this file actually exists
             if exists(potential_file_path):

                 #If file is legit, add to list of acceptable files
                 potential_file_path_list.append(potential_file_path)

     # Checking for an empty list
     if len(potential_file_path_list):
         return potential_file_path_list
     else:
         return ["trim_galore_input_determination_FAILURE" + wildcards.sample]

 rule trim_galore_unzipped_PE:
     input:
         unpack(trim_galore_input_determination)
     output:
         expand("{trim_out_path}/{{sample}}.{readDirection}.fq.gz",
             trim_out_path=config["trim_out_path"],
             readDirection=['1','2'])
     params:
         out_path=config['trim_out_path'],
     conda:
         'envs/biotools.yaml',
     shell:
         'trim_galore --gzip -o {params.out_path} --paired {input}'

config.yaml:

fq_in_path: input/fq
trim_out_path: output
sampleList: ["mySample1", "mySample2"]

$tree:

|-- [tboyarsk      1540 Sep  6 15:17]  Snakefile
|-- [tboyarsk        82 Sep  6 15:17]  config.yaml
|-- [tboyarsk       512 Sep  6  8:55]  input
|   |-- [tboyarsk       512 Sep  6  8:33]  fq
|   |   |-- [tboyarsk         0 Sep  6  7:50]  mySample1.1.fq
|   |   |-- [tboyarsk         0 Sep  6  8:24]  mySample1.2.fq
|   |   |-- [tboyarsk         0 Sep  6  7:50]  mySample2.1.fq
|   |   `-- [tboyarsk         0 Sep  6  8:24]  mySample2.2.fq
|   `-- [tboyarsk       512 Sep  6  8:55]  fqgz
|       |-- [tboyarsk         0 Sep  6  7:50]  mySample1.1.fq.gz
|       |-- [tboyarsk         0 Sep  6  8:32]  mySample1.2.fq.gz
|       |-- [tboyarsk         0 Sep  6  8:33]  mySample2.1.fq.gz
|       `-- [tboyarsk         0 Sep  6  8:32]  mySample2.2.fq.gz
`-- [tboyarsk       512 Sep  6  7:55]  output

$snakemake -dry (input: fg)

 rule trim_galore_unzipped_PE:
     input: input/fq/mySample1.1.fq, input/fq/mySample1.2.fq
     output: output/mySample1.1.fq.gz, output/mySample1.2.fq.gz
     jobid: 1
     wildcards: sample=mySample1


 rule trim_galore_unzipped_PE:
     input: input/fq/mySample2.1.fq, input/fq/mySample2.2.fq
     output: output/mySample2.1.fq.gz, output/mySample2.2.fq.gz
     jobid: 2
     wildcards: sample=mySample2


 localrule all:
     input: output/mySample1.1.fq.gz, output/mySample2.1.fq.gz, output/mySample1.2.fq.gz, output/   mySample2.2.fq.gz
     jobid: 0

 Job counts:
         count   jobs
         1       all
         2       trim_galore_unzipped_PE
         3

$snakemake -dry (input: fgqz)

 rule trim_galore_unzipped_PE:
     input: input/fqgz/mySample1.1.fq.gz, input/fqgz/mySample1.2.fq.gz
     output: output/mySample1.1.fq.gz, output/mySample1.2.fq.gz
     jobid: 1
     wildcards: sample=mySample1


 rule trim_galore_unzipped_PE:
     input: input/fqgz/mySample2.1.fq.gz, input/fqgz/mySample2.2.fq.gz
     output: output/mySample2.1.fq.gz, output/mySample2.2.fq.gz
     jobid: 2
     wildcards: sample=mySample2


 localrule all:
     input: output/mySample1.1.fq.gz, output/mySample1.2.fq.gz, output/mySample2.1.fq.gz, output/   mySample2.2.fq.gz
     jobid: 0

 Job counts:
         count   jobs
         1       all
         2       trim_galore_unzipped_PE
         3

There are ways to make it more generic, but since you declare and use the YAML config to build most of the file name, I will avoid discussing it in the answer. Just saying it's possible and somewhat encouraged.

The "--paired {input}" will expand to provide both files. Because of the for-loop, the 1 will always come before the 2.