the nextflow channel join pattern

23 views Asked by At
 inputfq = Channel
    .of('/playground/barcode01.fastq.gz',
        '/playground/barcode02.fastq.gz',
        '/playground/barcode03.fastq.gz')
    .toList()
    .view()

params.samples_csv = "./samples.csv"

CTRL_ch = Channel
    .fromPath(params.samples_csv, checkIfExists:true)
    .splitCsv(header: true, sep: ',')
    .map { row -> tuple(row.sampleID, file(row.infq), file(row.ref)) }
    .view()
    .join(inputfq, by: [1])
    .view()
    .map { csvRow, fqFile -> tuple(csvRow[0], fqFile, csvRow[2]) }
    .view()

The nextflow script is listed above, and the samples.csv file is at the same directory:

sampleID,infq,ref
DNA_CTRL,/playground/barcode01.fastq.gz,/playground/ref/DNA_ref.fa
RNA_CTRL,/playground/barcode03.fastq.gz,/playground/ref/RNA_ref.fa

But only the first two .view() results showed on the screen. So I think something wrong at the .join() step. Can someone show me what I did wrong?

My expected results will be [/playground/barcode01_reads_without_host.fastq.gz, /playground/barcode02_reads_without_host.fastq.gz, /playground/barcode03_reads_without_host.fastq.gz]

[DNA_CTRL, /playground/barcode01.fastq.gz, /playground/ref/DNA_ref.fa] [RNA_CTRL, /playground/barcode03.fastq.gz, /playground/ref/RNA_ref.fa]

[DNA_CTRL, /playground/barcode01.fastq.gz, /playground/ref/DNA_ref.fa] [RNA_CTRL, /playground/barcode03.fastq.gz, /playground/ref/RNA_ref.fa]

[DNA_CTRL, /playground/barcode01.fastq.gz, /playground/ref/DNA_ref.fa] [RNA_CTRL, /playground/barcode03.fastq.gz, /playground/ref/RNA_ref.fa]

============================

Thank you for the reply.

Sorry for the confusion. I realized that my samples.csv didn’t match the inputfq, so I fixed it.

What I am trying to do here is part of a maneuver to ensure that the processes are executed in the order I want. Inputfq is the output from one process, which generates many files and will always be executed. In the workflow, I have a conditional process that is executed when samples.csv is provided. The problem arises when samples.csv is provided; the process extracts information from the samples.csv and is executed prematurely, and the content of the (row.infq) in the CTRL-ch is one of the files generated in the process that creates the inputfq. Therefore, I use .join() to orchestrate the order of the processes, ensuring that the process that generates inputfq is executed first when samples.csv is provided, and this .join() approach is acted as a verification for the existence of the two input files in the samples.csv.

I put 4 .view() for debugging, and the results indicates something went wrong at the join() step hence this post.

1

There are 1 answers

1
dthorbur On

If you look at the docs for join, it shows that when one queue channel is exhausted (i.e., no more objects to iterate over) it stops producing channels. So, you only showed 2 output lines because your csv has only 2 lines.

I don't understand how you are trying to join the 2 channels since there are no overlapping file names, so the by operator shouldn't have emitted anything since file(row.infq) and inputfqs are not the same.

Regardless, I think you are looking for combine. But your expected output doesn't make sense to me as it looks like the output of the two separate channels without join, but with CTRL_ch repeated 3 times.