Using python to compare differences in strings from an input fastq file

Question

Using python to compare differences in strings from an input fastq file

183 views Asked by The Nightman At 12 June 2015 at 16:13

I would like to edit a sequencing Fastq file, and delete lines that are repetitive only at certain character positions. Ideally I would iterate over every line in the input file and output a file that has only a single instance of any unique set of characters.

So as shown below. I am only interested at looking at the first 6 chars, last 6 chars, and a portion of the intervening chars of every line, and keep only one instance of each unique combination of the three sequences.

AAAAAACCCCCCCCCCCCTTTTTTTTTTCCCCCCCCAAAAAA    Start by comparing to this line
AAAAAACCCAAACCCCCCTTTTTTTTTTCCCCCCCCAAAAAA    1-6, 19-28, 37-42 are same, so delete
AAAAAACCCCCCCCCCCCTTTTTTTTTTCCCAAACCAAAAAA    1-6, 19-28, 37-42 are same, so delete
TTTTTTCCCCCCCCCCCCTTTTTTTTTTCCCCCCCCAAAAAA    1-6 and 36-42 are same but 37-42 are different so keep

As shown in the above example if we take a file that only contains 4 lines, and i am looking at chars 1-6, 19-28, 37-42, lines 2 and 3 would be deleted, or not output to an output file because they have the same characters at each desired postion, but because line 4 is different it will not be deleted.

I have started with this the following code, and my idea is to set each position to a variable (but I don't know have to get intervening sequence), and then compare with each line as we iterate through the input file.

with open(current_file, 'r') as f:
    next(f)
    for line in f:
        start = line[:6]
        end = line[-7:]

If it helps, these files are also 5-10GB, so not tiny. I would appreciate any help. Thanks.

Original Q&A

There are 2 answers

**iayork** · Answer 1 · 2015-06-12T16:36:55+00:00

A simple approach is to use a dictionary with a key made out of the sections you want to compare. Each new instance will simply overwrite the last one and you'll save unique instances. For the examples you give:

a = 'AAAAAACCCCCCCCCCCCTTTTTTTTTTCCCCCCCCAAAAAA'    #Start by comparing to this line
b = 'AAAAAACCCAAACCCCCCTTTTTTTTTTCCCCCCCCAAAAAA'    #1-6, 19-28, 37-42 are same, so delete
c = 'AAAAAACCCCCCCCCCCCTTTTTTTTTTCCCAAACCAAAAAA'    #1-6, 19-28, 37-42 are same, so delete
d = 'TTTTTTCCCCCCCCCCCCTTTTTTTTTTCCCCCCCCAAAAAA'    #1-6 and 36-42 are same but 37-42 are different so keep
save_dict = {}
for fastq in (a,b,c,d):
    save_dict['%s%s%s' % (fastq[:6], fastq[19:28], fastq[37:42])] = fastq

ends up with save_dict containing

{'AAAAAACTTTTTTTTTCAAAAA': 'AAAAAACCCCCCCCCCCCTTTTTTTTTTCCCAAACCAAAAAA',
 'TTTTTTCTTTTTTTTTCAAAAA': 'TTTTTTCCCCCCCCCCCCTTTTTTTTTTCCCCCCCCAAAAAA'}

(Check the indexes, I may have not included the ones you're after)

**James** · Answer 2 · 2015-06-12T16:37:12+00:00

Here is a script that does the following: Pull the elements from each line as a combined string, check it against a list of combined string it has already seen, add the line to a list if it is a new instance.

with open('path/to/file', 'r') as f:
    lineCharsList = []
    outLines = []
    for line in f:
        lineChars = line[0:6]+line[18:28]+line[36:42]
        if not (lineChars in lineCharsList):
            lineCharsList.append(lineChars)
            outLines.append(line)

TechQA.

Using python to compare differences in strings from an input fastq file

There are 2 answers

Related Questions in PYTHON

Related Questions in FASTQ

Related Questions in SEQUENCING

Popular Questions

Popular Tags

Trending Questions