I have an array of structure data similar to:
- name: foobar
sex: male
fastqs:
- r1: /path/to/foobar_R1.fastq.gz
r2: /path/to/foobar_R2.fastq.gz
- r1: /path/to/more/foobar_R1.fastq.gz
r2: /path/to/more/foobar_R2.fastq.gz
- name: bazquux
sex: female
fastqs:
- r1: /path/to/bazquux_R1.fastq.gz
r2: /path/to/bazquux_R2.fastq.gz
Note that fastqs come in pairs, and the number of pairs per "sample" may be variable.
I want to write a process
in nextflow that processes one sample at a time.
In order for the nextflow executor to properly marshal the files, they must somehow be typed as path
(or file
). Thus typed, the executor will copy the files to the compute node for processing. Simply typing the files paths as var
will treat the paths as strings and no files will be copied.
A trivial example of a path
input from the docs:
process foo {
input:
path x from '/some/data/file.txt'
"""
your_command --in $x
"""
}
How should I go about declaring the process
input so that the files are properly marshaled to the compute node? So far I haven't found any examples in the docs for how to handle structured inputs.
Your structured data looks a lot like YAML. If you can include a top-level object so that your file looks something like this:
Then, we can use Nextflow's
-params-file
option to load the params when we run our workflow. We can access the top-level object from the params, which gives us a list that we can use to create a Channel using thefromList
factory method. The following example uses the new DSL 2:Results:
As requested, here's a solution that runs per sample. The problem we have is that we cannot simply feed in a list of lists using the
path
qualifier (since an ArrayList is not a valid path value). We could flatten() the list of file pairs, but this makes it difficult to access each of the file pairs if we need them. You may not necessarily need the file pair relationship but assuming you do, I think the right solution is to feed the R1 and R2 files in separately (i.e. using a path qualifier for R1 and another path qualifier for R2). The following example introspects the instance type to (re-)create the list of readgroups. We can use thestageAs
option to localize the files into progressively indexed subdirectories, since some files in the YAML have identical names.Results: