getting records which are different from two fastq files

886 views Asked by At

I have 2 fastq files F1.fastq and F2.fastq. F2.fastq is a smaller file which is a subset of reads from F1.fastq. I want reads in F1.fastq which ARE NOT in F2.fastq. The following python code does not seem to work. Can you suggest edits?

needed_reads = []

reads_array = []

chosen_array = []

for x in Bio.SeqIO.parse("F1.fastq","fastq"):

        reads_array.append(x)

for y in Bio.SeqIO.parse("F2.fastq","fastq"):

        chosen_array.append(y)

for y in chosen_array:

        for x in reads_array:

                if str(x.seq) != str(y.seq) : needed_reads.append(x)

output_handle = open("DIFF.fastq","w")

SeqIO.write(needed_reads,output_handle,"fastq")

output_handle.close()
1

There are 1 answers

5
Anand S Kumar On BEST ANSWER

You can use sets for accomplishing your requirement , you can convert list1 to set and then list2 to set , and then do set(list1) - set(list2) , it will give items in list1 that are not in list2 .

Sample code -

needed_reads = []

reads_array = []

chosen_array = []

for x in Bio.SeqIO.parse("F1.fastq","fastq"):

        reads_array.append(x)

for y in Bio.SeqIO.parse("F2.fastq","fastq"):

        chosen_array.append(y)

needed_reads = list(set(reads_array) - set(chosen_array))

output_handle = open("DIFF.fastq","w")

SeqIO.write(needed_reads,output_handle,"fastq")

output_handle.close()