Using python to compare differences in strings from an input fastq file

216 views Asked by At

I would like to edit a sequencing Fastq file, and delete lines that are repetitive only at certain character positions. Ideally I would iterate over every line in the input file and output a file that has only a single instance of any unique set of characters.

So as shown below. I am only interested at looking at the first 6 chars, last 6 chars, and a portion of the intervening chars of every line, and keep only one instance of each unique combination of the three sequences.

AAAAAACCCCCCCCCCCCTTTTTTTTTTCCCCCCCCAAAAAA    Start by comparing to this line
AAAAAACCCAAACCCCCCTTTTTTTTTTCCCCCCCCAAAAAA    1-6, 19-28, 37-42 are same, so delete
AAAAAACCCCCCCCCCCCTTTTTTTTTTCCCAAACCAAAAAA    1-6, 19-28, 37-42 are same, so delete
TTTTTTCCCCCCCCCCCCTTTTTTTTTTCCCCCCCCAAAAAA    1-6 and 36-42 are same but 37-42 are different so keep

As shown in the above example if we take a file that only contains 4 lines, and i am looking at chars 1-6, 19-28, 37-42, lines 2 and 3 would be deleted, or not output to an output file because they have the same characters at each desired postion, but because line 4 is different it will not be deleted.

I have started with this the following code, and my idea is to set each position to a variable (but I don't know have to get intervening sequence), and then compare with each line as we iterate through the input file.

with open(current_file, 'r') as f:
    next(f)
    for line in f:
        start = line[:6]
        end = line[-7:]

If it helps, these files are also 5-10GB, so not tiny. I would appreciate any help. Thanks.

2

There are 2 answers

2
iayork On

A simple approach is to use a dictionary with a key made out of the sections you want to compare. Each new instance will simply overwrite the last one and you'll save unique instances. For the examples you give:

a = 'AAAAAACCCCCCCCCCCCTTTTTTTTTTCCCCCCCCAAAAAA'    #Start by comparing to this line
b = 'AAAAAACCCAAACCCCCCTTTTTTTTTTCCCCCCCCAAAAAA'    #1-6, 19-28, 37-42 are same, so delete
c = 'AAAAAACCCCCCCCCCCCTTTTTTTTTTCCCAAACCAAAAAA'    #1-6, 19-28, 37-42 are same, so delete
d = 'TTTTTTCCCCCCCCCCCCTTTTTTTTTTCCCCCCCCAAAAAA'    #1-6 and 36-42 are same but 37-42 are different so keep
save_dict = {}
for fastq in (a,b,c,d):
    save_dict['%s%s%s' % (fastq[:6], fastq[19:28], fastq[37:42])] = fastq

ends up with save_dict containing

{'AAAAAACTTTTTTTTTCAAAAA': 'AAAAAACCCCCCCCCCCCTTTTTTTTTTCCCAAACCAAAAAA',
 'TTTTTTCTTTTTTTTTCAAAAA': 'TTTTTTCCCCCCCCCCCCTTTTTTTTTTCCCCCCCCAAAAAA'}

(Check the indexes, I may have not included the ones you're after)

1
James On

Here is a script that does the following: Pull the elements from each line as a combined string, check it against a list of combined string it has already seen, add the line to a list if it is a new instance.

with open('path/to/file', 'r') as f:
    lineCharsList = []
    outLines = []
    for line in f:
        lineChars = line[0:6]+line[18:28]+line[36:42]
        if not (lineChars in lineCharsList):
            lineCharsList.append(lineChars)
            outLines.append(line)