Nexflow: structured inputs with files

234 views Asked by At

I have an array of structure data similar to:

- name: foobar
  sex: male
  fastqs:
  - r1: /path/to/foobar_R1.fastq.gz
    r2: /path/to/foobar_R2.fastq.gz
  - r1: /path/to/more/foobar_R1.fastq.gz
    r2: /path/to/more/foobar_R2.fastq.gz
- name: bazquux
  sex: female
  fastqs:
  - r1: /path/to/bazquux_R1.fastq.gz
    r2: /path/to/bazquux_R2.fastq.gz

Note that fastqs come in pairs, and the number of pairs per "sample" may be variable.

I want to write a process in nextflow that processes one sample at a time.

In order for the nextflow executor to properly marshal the files, they must somehow be typed as path (or file). Thus typed, the executor will copy the files to the compute node for processing. Simply typing the files paths as var will treat the paths as strings and no files will be copied.

A trivial example of a path input from the docs:

process foo {
  input:
    path x from '/some/data/file.txt'
  """
    your_command --in $x
  """
}

How should I go about declaring the process input so that the files are properly marshaled to the compute node? So far I haven't found any examples in the docs for how to handle structured inputs.

1

There are 1 answers

2
Steve On

Your structured data looks a lot like YAML. If you can include a top-level object so that your file looks something like this:

samples:
- name: foobar
  sex: male
  fastqs:
  - r1: ./path/to/foobar_R1.fastq.gz
    r2: ./path/to/foobar_R2.fastq.gz
  - r1: ./path/to/more/foobar_R1.fastq.gz
    r2: ./path/to/more/foobar_R2.fastq.gz
- name: bazquux
  sex: female
  fastqs:
  - r1: ./path/to/bazquux_R1.fastq.gz
    r2: ./path/to/bazquux_R2.fastq.gz

Then, we can use Nextflow's -params-file option to load the params when we run our workflow. We can access the top-level object from the params, which gives us a list that we can use to create a Channel using the fromList factory method. The following example uses the new DSL 2:

process test_proc {

    tag { sample_name }

    debug true
    stageInMode 'rellink'

    input:
    tuple val(sample_name), val(sex), path(fastqs)

    """
    echo "${sample_name},${sex}:"

    ls -g *.fastq.gz
    """
}

workflow {

    Channel.fromList( params.samples )
        | flatMap { rec ->
            rec.fastqs.collect { rg -> 
                readgroup = tuple( file(rg.r1), file(rg.r2) )

                tuple( rec.name, rec.sex, readgroup )
            }
        }
        | test_proc
}

Results:

$ mkdir -p ./path/to/more
$ touch ./path/to/foobar_R{1,2}.fastq.gz
$ touch ./path/to/more/foobar_R{1,2}.fastq.gz
$ touch ./path/to/bazquux_R{1,2}.fastq.gz

$ nextflow run main.nf -params-file file.yaml 
N E X T F L O W  ~  version 22.04.0
Launching `main.nf` [desperate_colden] DSL2 - revision: 391a9a3b3a
executor >  local (3)
[ed/61c5c3] process > test_proc (foobar) [100%] 3 of 3 ✔
foobar,male:
lrwxrwxrwx 1 users 35 Oct 14 13:56 foobar_R1.fastq.gz -> ../../../path/to/foobar_R1.fastq.gz
lrwxrwxrwx 1 users 35 Oct 14 13:56 foobar_R2.fastq.gz -> ../../../path/to/foobar_R2.fastq.gz

bazquux,female:
lrwxrwxrwx 1 users 36 Oct 14 13:56 bazquux_R1.fastq.gz -> ../../../path/to/bazquux_R1.fastq.gz
lrwxrwxrwx 1 users 36 Oct 14 13:56 bazquux_R2.fastq.gz -> ../../../path/to/bazquux_R2.fastq.gz

foobar,male:
lrwxrwxrwx 1 users 40 Oct 14 13:56 foobar_R1.fastq.gz -> ../../../path/to/more/foobar_R1.fastq.gz
lrwxrwxrwx 1 users 40 Oct 14 13:56 foobar_R2.fastq.gz -> ../../../path/to/more/foobar_R2.fastq.gz


As requested, here's a solution that runs per sample. The problem we have is that we cannot simply feed in a list of lists using the path qualifier (since an ArrayList is not a valid path value). We could flatten() the list of file pairs, but this makes it difficult to access each of the file pairs if we need them. You may not necessarily need the file pair relationship but assuming you do, I think the right solution is to feed the R1 and R2 files in separately (i.e. using a path qualifier for R1 and another path qualifier for R2). The following example introspects the instance type to (re-)create the list of readgroups. We can use the stageAs option to localize the files into progressively indexed subdirectories, since some files in the YAML have identical names.

process test_proc {

    tag { sample_name }

    debug true
    stageInMode 'rellink'

    input:
    tuple val(sample_name), val(sex), path(r1, stageAs:'*/*'), path(r2, stageAs:'*/*')

    script:
    if( [r1, r2].every { it instanceof List } )
        readgroups = [r1, r2].transpose()
    else if( [r1, r2].every { it instanceof Path } )
        readgroups = [[r1, r2], ]
    else
        error "Invalid readgroup configuration"

    read_pairs = readgroups.collect { r1, r2 -> "${r1},${r2}" }

    """
    echo "${sample_name},${sex}:"
    echo ${read_pairs.join(' ')}

    ls -g */*.fastq.gz
    """
}
workflow {

    Channel.fromList( params.samples )
        | map { rec ->

            def r1 = rec.fastqs.r1.collect { file(it) }
            def r2 = rec.fastqs.r2.collect { file(it) }

            tuple( rec.name, rec.sex, r1, r2 )
        }
        | test_proc
}

Results:

$ nextflow run main.nf -params-file file.yaml 
N E X T F L O W  ~  version 22.04.0
Launching `main.nf` [berserk_sanger] DSL2 - revision: 2f317a8cee
executor >  local (2)
[93/6345c9] process > test_proc (bazquux) [100%] 2 of 2 ✔
foobar,male:
1/foobar_R1.fastq.gz,1/foobar_R2.fastq.gz 2/foobar_R1.fastq.gz,2/foobar_R2.fastq.gz
lrwxrwxrwx 1 users 38 Oct 19 13:43 1/foobar_R1.fastq.gz -> ../../../../path/to/foobar_R1.fastq.gz
lrwxrwxrwx 1 users 38 Oct 19 13:43 1/foobar_R2.fastq.gz -> ../../../../path/to/foobar_R2.fastq.gz
lrwxrwxrwx 1 users 43 Oct 19 13:43 2/foobar_R1.fastq.gz -> ../../../../path/to/more/foobar_R1.fastq.gz
lrwxrwxrwx 1 users 43 Oct 19 13:43 2/foobar_R2.fastq.gz -> ../../../../path/to/more/foobar_R2.fastq.gz

bazquux,female:
1/bazquux_R1.fastq.gz,1/bazquux_R2.fastq.gz
lrwxrwxrwx 1 users 39 Oct 19 13:43 1/bazquux_R1.fastq.gz -> ../../../../path/to/bazquux_R1.fastq.gz
lrwxrwxrwx 1 users 39 Oct 19 13:43 1/bazquux_R2.fastq.gz -> ../../../../path/to/bazquux_R2.fastq.gz