Bio Perl: code to split paired end data?

1.1k views Asked by At

I am a beginner with bioinformatics and I have been working on a little Bio Perl code to split my paired end MiSeq data (currently in 1 fastq file) into 2 files, each file containing one end of the pair. The different ends of the paired end reads can be distinguished by a 1 or a 2 after the space in the fastq header. The file follows a typical fastq format, example from using "head" in the command line:

@M00763:6:000000000-A1U80:1:1101:12620:1732 1:N:0:1
TTATACTC
+
@A@AA@A@
@M00763:6:000000000-A1U80:1:1101:12620:1732 2:N:0:1
T
+
E

I have written a code trying to target the 1 or 2 in the header using a match. Although I am using Bio::SeqIO perl does not seem to be recognizing the fastq format, and I keep getting this error:

MSG: Could not guess format from file/fh
STACK: Error::throw
STACK: Bio::Root::Root::throw /sw/lib/perl5/5.12.3/Bio/Root/Root.pm:472
STACK: Bio::SeqIO::new /sw/lib/perl5/5.12.3/Bio/SeqIO.pm:389
STACK: SplitPairedEndReads.pl:7

Can someone help me find/fix my error? The information available from BioPerl website indicates that Bio::SeqIO should be able to recognize fastq format.

Here is the code I have written:

#!/usr/bin/perl 

use Bio::SeqIO;
use Bio::SeqIO::fastq;


$seqout1 = Bio::SeqIO->new(-file => ">peread1.fastq" -format => "fastq",);
$seqout2 = Bio::SeqIO->new(-file => ">peread2.fastq" -format => "fastq",);

$seqio_obj = Bio::SeqIO->new(-file => "AIS351_Strin1edit.fastq", -format => "fastq",
                         -alphabet => "dna" );
$seq_obj = $seqio_obj->next_seq;

while ($seq_obj = $seqio_obj->next_seq) { 
    $name = $seq_obj->desc; if($name=~ / 1:/) {$seqout1->write_seq($seq_obj);
     } else { $seqout2->write_seq($seq_obj); 

    }
}

Thanks for your help and your patience with my beginner knowledge.

~Al

3

There are 3 answers

2
SES On

I would advise you to not use BioPerl for Fastq data because it is incredibly slow (see my comments below). You can use Pairfq for this task because this is one of the things it was designed for (full disclosure: I'm the author). Here's how it would work:

pairfq splitpairs -i AIS351_Strin1edit.fastq -f AIS351_Strin1edit_1.fastq -r AIS351_Strin1edit_2.fastq

In my benchmarks, this is about 300X faster than doing the equivalent task with BioPerl. For example, I measured that it takes 465 seconds to read 1 million Fastq records with Bio::SeqIO, while the above code can do it in about 1.5 seconds. If you have 500 million records, that is a difference in 64 hours vs. 11 minutes. That is why it is strongly discouraged to use BioPerl for NGS data. I'm not bashing BioPerl because I use it everyday, but be aware of this issue.

About the error in your comments, the BioPerl parser does not like what is on your '+' line. There must be nothing after the '+' or it must match the sequence header. It's hard to say specifically without seeing the real data, it could also be a line ending problem or something else.

EDIT: You need to put use strict; and use warnings; at the top of every script. Also, it is a good idea to test if a file exists before trying to do anything with it (like trying to read it with BioPerl). About your last question, I suggest you read up on the FASTQ format. You can't just remove lines from the record or else it won't be valid FASTQ. A minor point is you don't need to use Bio::SeqIO::fastq; because Bio::SeqIO will handle loading the appropriate class.

What you posted does not look like real data so it's not easy to say what is causing the problem.

1
fugu On

You could achieve what you're after with this snippet:

#!/usr/bin/perl
use warnings;
use strict; 

my @array = ('@M00763:6:000000000-A1U80:1:1101:12620:1732 1:N:0:1
TTATACTC
+
@A@AA@A@',
'@M00763:6:000000000-A1U80:1:1101:12620:1732 2:N:0:1
T
+
E');

foreach (@array){
        if (/\s+1:/) {
            print "1st pair: $_\n"; # You could redirect this to first.OUTFILE
         }
        if (/\s+2:/) {
            print "2nd pair: $_\n"; # You could redirect this to second.OUTFILE
         }

}

Which prints:

1st pair: @M00763:6:000000000-A1U80:1:1101:12620:1732 1:N:0:1
TTATACTC
+
@A@AA@A@
2nd pair: @M00763:6:000000000-A1U80:1:1101:12620:1732 2:N:0:1
T
+
0
toolic On

You need to add commas between all list items in your calls to new. Change:

$seqout1 = Bio::SeqIO->new(-file => ">peread1.fastq" -format => "fastq",);
$seqout2 = Bio::SeqIO->new(-file => ">peread2.fastq" -format => "fastq",);

to:

$seqout1 = Bio::SeqIO->new(-file => ">peread1.fastq", -format => "fastq",);
$seqout2 = Bio::SeqIO->new(-file => ">peread2.fastq", -format => "fastq",);