I have an array of structure data similar to:
- name: foobar
sex: male
fastqs:
- r1: /path/to/foobar_R1.fastq.gz
r2: /path/to/foobar_R2.fastq.gz
- r1: /path/to/more/foobar_R1.fastq.gz
r2: /path/to/more/foobar_R2.fastq.gz
- name: bazquux
sex: female
fastqs:
- r1: /path/to/bazquux_R1.fastq.gz
r2: /path/to/bazquux_R2.fastq.gz
Note that fastqs come in pairs, and the number of pairs per "sample" may be variable.
I want to write a process in nextflow that processes one sample at a time.
In order for the nextflow executor to properly marshal the files, they must somehow be typed as path (or file). Thus typed, the executor will copy the files to the compute node for processing. Simply typing the files paths as var will treat the paths as strings and no files will be copied.
A trivial example of a path input from the docs:
process foo {
input:
path x from '/some/data/file.txt'
"""
your_command --in $x
"""
}
How should I go about declaring the process input so that the files are properly marshaled to the compute node? So far I haven't found any examples in the docs for how to handle structured inputs.
Your structured data looks a lot like YAML. If you can include a top-level object so that your file looks something like this:
Then, we can use Nextflow's
-params-fileoption to load the params when we run our workflow. We can access the top-level object from the params, which gives us a list that we can use to create a Channel using thefromListfactory method. The following example uses the new DSL 2:Results:
As requested, here's a solution that runs per sample. The problem we have is that we cannot simply feed in a list of lists using the
pathqualifier (since an ArrayList is not a valid path value). We could flatten() the list of file pairs, but this makes it difficult to access each of the file pairs if we need them. You may not necessarily need the file pair relationship but assuming you do, I think the right solution is to feed the R1 and R2 files in separately (i.e. using a path qualifier for R1 and another path qualifier for R2). The following example introspects the instance type to (re-)create the list of readgroups. We can use thestageAsoption to localize the files into progressively indexed subdirectories, since some files in the YAML have identical names.Results: