I am a beginner with bioinformatics and I have been working on a little Bio Perl code to split my paired end MiSeq data (currently in 1 fastq file) into 2 files, each file containing one end of the pair. The different ends of the paired end reads can be distinguished by a 1 or a 2 after the space in the fastq header. The file follows a typical fastq format, example from using "head" in the command line:
@M00763:6:000000000-A1U80:1:1101:12620:1732 1:N:0:1
TTATACTC
+
@A@AA@A@
@M00763:6:000000000-A1U80:1:1101:12620:1732 2:N:0:1
T
+
E
I have written a code trying to target the 1 or 2 in the header using a match. Although I am using Bio::SeqIO perl does not seem to be recognizing the fastq format, and I keep getting this error:
MSG: Could not guess format from file/fh
STACK: Error::throw
STACK: Bio::Root::Root::throw /sw/lib/perl5/5.12.3/Bio/Root/Root.pm:472
STACK: Bio::SeqIO::new /sw/lib/perl5/5.12.3/Bio/SeqIO.pm:389
STACK: SplitPairedEndReads.pl:7
Can someone help me find/fix my error? The information available from BioPerl website indicates that Bio::SeqIO should be able to recognize fastq format.
Here is the code I have written:
#!/usr/bin/perl
use Bio::SeqIO;
use Bio::SeqIO::fastq;
$seqout1 = Bio::SeqIO->new(-file => ">peread1.fastq" -format => "fastq",);
$seqout2 = Bio::SeqIO->new(-file => ">peread2.fastq" -format => "fastq",);
$seqio_obj = Bio::SeqIO->new(-file => "AIS351_Strin1edit.fastq", -format => "fastq",
-alphabet => "dna" );
$seq_obj = $seqio_obj->next_seq;
while ($seq_obj = $seqio_obj->next_seq) {
$name = $seq_obj->desc; if($name=~ / 1:/) {$seqout1->write_seq($seq_obj);
} else { $seqout2->write_seq($seq_obj);
}
}
Thanks for your help and your patience with my beginner knowledge.
~Al
I would advise you to not use BioPerl for Fastq data because it is incredibly slow (see my comments below). You can use Pairfq for this task because this is one of the things it was designed for (full disclosure: I'm the author). Here's how it would work:
In my benchmarks, this is about 300X faster than doing the equivalent task with BioPerl. For example, I measured that it takes 465 seconds to read 1 million Fastq records with Bio::SeqIO, while the above code can do it in about 1.5 seconds. If you have 500 million records, that is a difference in 64 hours vs. 11 minutes. That is why it is strongly discouraged to use BioPerl for NGS data. I'm not bashing BioPerl because I use it everyday, but be aware of this issue.
About the error in your comments, the BioPerl parser does not like what is on your '+' line. There must be nothing after the '+' or it must match the sequence header. It's hard to say specifically without seeing the real data, it could also be a line ending problem or something else.
EDIT: You need to put
use strict;
anduse warnings;
at the top of every script. Also, it is a good idea to test if a file exists before trying to do anything with it (like trying to read it with BioPerl). About your last question, I suggest you read up on the FASTQ format. You can't just remove lines from the record or else it won't be valid FASTQ. A minor point is you don't need touse Bio::SeqIO::fastq;
becauseBio::SeqIO
will handle loading the appropriate class.What you posted does not look like real data so it's not easy to say what is causing the problem.