I would like to edit a sequencing Fastq file, and delete lines that are repetitive only at certain character positions. Ideally I would iterate over every line in the input file and output a file that has only a single instance of any unique set of characters.
So as shown below. I am only interested at looking at the first 6 chars, last 6 chars, and a portion of the intervening chars of every line, and keep only one instance of each unique combination of the three sequences.
AAAAAACCCCCCCCCCCCTTTTTTTTTTCCCCCCCCAAAAAA Start by comparing to this line
AAAAAACCCAAACCCCCCTTTTTTTTTTCCCCCCCCAAAAAA 1-6, 19-28, 37-42 are same, so delete
AAAAAACCCCCCCCCCCCTTTTTTTTTTCCCAAACCAAAAAA 1-6, 19-28, 37-42 are same, so delete
TTTTTTCCCCCCCCCCCCTTTTTTTTTTCCCCCCCCAAAAAA 1-6 and 36-42 are same but 37-42 are different so keep
As shown in the above example if we take a file that only contains 4 lines, and i am looking at chars 1-6, 19-28, 37-42, lines 2 and 3 would be deleted, or not output to an output file because they have the same characters at each desired postion, but because line 4 is different it will not be deleted.
I have started with this the following code, and my idea is to set each position to a variable (but I don't know have to get intervening sequence), and then compare with each line as we iterate through the input file.
with open(current_file, 'r') as f:
next(f)
for line in f:
start = line[:6]
end = line[-7:]
If it helps, these files are also 5-10GB, so not tiny. I would appreciate any help. Thanks.
A simple approach is to use a dictionary with a key made out of the sections you want to compare. Each new instance will simply overwrite the last one and you'll save unique instances. For the examples you give:
ends up with save_dict containing
(Check the indexes, I may have not included the ones you're after)