Add missing lines in file with python

894 views Asked by At

I am a beginner when it comes to programming and python and such. So apologies if this is kind of a simple question.

But I have large files that for example contain lines like this:

10000     7
20000     1
30000     2
60000     3

What I want to have, is a file that also contains the 'missing' lines, like this:

10000     7
20000     1
30000     2
40000     0
50000     0
60000     3

The files are rather large as I am working with whole genome sequence data. The first column is basically a position in the genome and the second column is the number of SNPs I find within that 10kb window. However, I don't think this information is even relevant, I just want to write a simple python code that will add these lines to the file by using if else statements.

So if the position does not match the position of the previous line + 10000, the 'missing line' is written, otherwise the normal occurring line is written.

I just foresee one problem in this, namely when several lines in a row are missing (as in my example). Does anyone have a smart solution for this simple problem?

Many thanks!

4

There are 4 answers

3
Maurice On BEST ANSWER

How about this:

# Replace lines.txt with your actual file
with open("lines.txt", "r") as file:
    last_line = 0
    lines = []
    for line in file:
        num1, num2 = [int(i) for i in line.split("\t")]
        while num1 != last_line + 10000:
            # A line is missing
            lines.append((last_line + 10000, 0)) 
            last_line += 10000
        lines.append((num1, num2))
        last_line = num1
    for num1, num2 in lines:
        # You should print to a different file here
        print(num1, num2)

Instead of the last print statement you would write the values to a new file.

Edit: I ran this code on this sample. Output below.

lines.txt

10000   7
20000   1
30000   2
60000   3

Output

10000 7
20000 1
30000 2
40000 0
50000 0
60000 3
1
Gareth McCaughan On

I would suggest a program along the following lines. You keep track of the genome position you saw last (it would be 0 at the start, I guess). Then you read lines from the input file, one by one. For each one, you output first any missing lines (from the previous genome position + 10kb, in 10kb steps, to 10kb before the new line you've read) and then the line you have just read.

In other words, the tiny thing you're missing is that when "the position does not match the position of the previous line + 10000", you should have a little loop to generate the missing output, rather than just writing out one line. (The following remark may make no sense until you actually start writing the code: You don't actually need to test whether the position matches; if you write it right, you will find that when it matches your loop outputs no extra lines)

For various good reasons, the usual practice here is not to write the code for you :-), but I hope the above will help.

3
Patrick Haugh On
from collections import defaultdict

d = defaultdict(int)
with open('file1.txt') as infile:
    for l in infile:
        pos, count = l.split()
        d[int(pos)] = int(count)

with open('file2.txt') as outfile:
    for i in range(10000, pos+1, 10000):
        outfile.write('{}\t{}'.format(i, d[i]))

Here's a quick version. We read the file into a defaultdict. When we access the values later, any key that doesn't have an associated value will get the default value of zero. Then we take every number in the range 10000 to pos where pos is the last position in the first file, taken in steps of 10000. We access these values in the defaultdict and write them to the second file.

0
Darth Kotik On

I would use defaultdict which will use 0 as default value So you will read your file to this defaultdict and than read it (handling keys manually) and write it back to file.

It will look somewhat like this

from collections import defaultdict

x = defaultdict(int)
with open(filename) as f:
    data = x.split()
    x[data[0]] = x[data[-1]]

with open(filename, 'w') as f:
    for i in range(0, max(x.keys())+1, 10000):
        f.write('{}\t{}\n'.format(i, x[i]))