I am using Python 2.6.6 and I am trying to remove fastq
reads in file2
that overlap (i.e., are identical to) reads in file1
. Here is code I am trying to implement:
ref_reads = SeqIO.index("file1.fastq", "fastq")
spk_reads = SeqIO.index("file2.fastq", "fastq")
for spk in spk_reads:
if spk in ref_reads:
del ref_reads[spk]
However, I get this error related to my use of del
:
AttributeError: _IndexedSeqFileDict instance has no attribute '__delitem__'
Is it possible to delete an item using the present formulation? How can I delete an item from a dictionary generated using SeqIO.index()
?
I also tried the following:
# import read data
ref_reads = SeqIO.index("main.fastq", "fastq")
spk_reads = SeqIO.index("over.fastq", "fastq")
# note that ref_reads.keys() doesn't return a list but a 'dictionary- keyiterator',
# so we turn it into a set to work with it
ref_keys = set(ref_reads.keys())
spk_keys = set(spk_reads.keys())
# loop to remove overlap reads
for spk in spk_keys:
if spk in ref_keys:
del ref_keys[spk]
# output data
output_handle = open(fname_out, "w")
SeqIO.write(ref_reads[ref_keys], output_handle, "fastq")
output_handle.close()
SeqIO.index() doesn't return a true dictionary, but a dictionary like object, giving the SeqRecord objects as values:
This dictionary like object is an
_IndexedSeqFileDict
instance. The docstring mentions:So, you will need to convert your fastq file to an in-memory Python dictionary using
SeqIO.parse()
andSeqIO.to_dict()
:If your files are so large that working with
SeqIO.parse()
isn't feasible, then I would do something like this:Edit, answer to your comment:
Like I stated above,
SeqIO.index("file1.fastq", "fastq")
returns a read-only_IndexedSeqFileDict
object. So you can not, by design, delete items from it.The updated code below shows how you can create a new fastq file where the overlapping reads are removed.
If you really want a new
SeqIO.index()
object, then you can read this file in again withSeqIO.index()
.