Delete an item from a dictionary generated by SeqIO.index

414 views Asked by At

I am using Python 2.6.6 and I am trying to remove fastq reads in file2 that overlap (i.e., are identical to) reads in file1. Here is code I am trying to implement:

ref_reads = SeqIO.index("file1.fastq", "fastq")
spk_reads = SeqIO.index("file2.fastq", "fastq")

for spk in spk_reads:
    if spk in ref_reads:
    del ref_reads[spk]

However, I get this error related to my use of del:

AttributeError: _IndexedSeqFileDict instance has no attribute '__delitem__'

Is it possible to delete an item using the present formulation? How can I delete an item from a dictionary generated using SeqIO.index()?

I also tried the following:

# import read data
ref_reads = SeqIO.index("main.fastq", "fastq")
spk_reads = SeqIO.index("over.fastq", "fastq")

# note that ref_reads.keys() doesn't return a list but a 'dictionary-       keyiterator', 
# so we turn it into a set to work with it
ref_keys = set(ref_reads.keys())  
spk_keys = set(spk_reads.keys())

# loop to remove overlap reads
for spk in spk_keys:
    if spk in ref_keys:
        del ref_keys[spk]

# output data
output_handle = open(fname_out, "w")
SeqIO.write(ref_reads[ref_keys], output_handle, "fastq")
output_handle.close()
1

There are 1 answers

1
BioGeek On

SeqIO.index() doesn't return a true dictionary, but a dictionary like object, giving the SeqRecord objects as values:

Note that this pseudo dictionary will not support all the methods of a true Python dictionary, for example values() is not defined since this would require loading all of the records into memory at once.

This dictionary like object is an _IndexedSeqFileDict instance. The docstring mentions:

Note that this dictionary is essentially read only. You cannot add or change values, pop values, nor clear the dictionary.

So, you will need to convert your fastq file to an in-memory Python dictionary using SeqIO.parse() and SeqIO.to_dict():

from Bio import SeqIO

ref_reads = SeqIO.parse("file1.fastq", "fastq")
spk_reads = SeqIO.parse("file1.fastq", "fastq")

ref_reads_dict = SeqIO.to_dict(ref_reads)

for spk in spk_reads:
    if spk.id in ref_reads_dict:
        del ref_reads_dict[spk.id]

If your files are so large that working with SeqIO.parse() isn't feasible, then I would do something like this:

from Bio import SeqIO

ref_reads = SeqIO.index("file1.fastq", "fastq")
spk_reads = SeqIO.index("file2.fastq", "fastq")

# note that ref_reads.keys() doesn't return a list but a 'dictionary-keyiterator', 
# so we turn it into a set to work with it
ref_keys = set(ref_reads.keys())  
spk_keys = set(spk_reads.keys())

unique_ref_keys = ref_keys - spk_keys

# this step might take a long time if your files are large
unique_ref_reads = {key: ref_reads[key] for key in unique_ref_keys} 

Edit, answer to your comment:

how can I again solve the original problem of deleting items from SeqIO.index("file1.fastq", "fastq")?

Like I stated above, SeqIO.index("file1.fastq", "fastq") returns a read-only _IndexedSeqFileDict object. So you can not, by design, delete items from it.

The updated code below shows how you can create a new fastq file where the overlapping reads are removed.

If you really want a new SeqIO.index() object, then you can read this file in again with SeqIO.index().

from Bio import SeqIO

ref_reads = SeqIO.index("file1.fastq", "fastq")
spk_reads = SeqIO.index("file2.fastq", "fastq")

ref_keys = set(ref_reads.keys())  
spk_keys = set(spk_reads.keys())

unique_ref_keys = ref_keys - spk_keys

# conserve memory by using a generator expression
unique_ref_records = (ref_reads[key] for key in unique_ref_keys)

# output new file with overlapping reads removed
with open(fname_out, "w") as output_handle:
    SeqIO.write(unique_ref_records , output_handle, "fastq")

# optionally, create a new SeqIO.index() object 
unique_ref_reads = SeqIO.index(fname_out, "fastq")