How to compare two files quickly?

Question

How to compare two files quickly?

191 views Asked by user3904534 At 03 January 2025 at 16:54

I need to be able to compare the two coordinates (the 2nd and 3rd word in a line) to see where they overlap. Now, my code does it, but it does it very slow. So far for a file with 10000 lines my code takes about two minutes. I need to use it for a file with 3 billion lines, which I estimate will take forever. Is there a way to refactor my code to be so much faster?

So far I can do exactly what I want. Which is this:

import os.path
with open("Output.txt", "w") as result:
  with open("bedgraph2.txt") as file1:
    for f1_line in file1:
      segment_1 = f1_line.split()
      with open("bedgraph1.txt") as file2:
        for f2_line in file2:
          segment_2 = f2_line.split()
          if (int(segment_1[2]) > int(segment_2[1])) & (int(segment_1[1]) < int(segment_2[2])):
            with open("Output.txt", "a") as add:
              add.write(segment_1[0])
              add.write(" ")
              add.write(segment_1[1])
              add.write(" ")
              add.write(segment_1[2])
              add.write(" ")
              add.write(segment_1[3])
              add.write(" | ")
              add.write(segment_2[0])
              add.write(" ")
              add.write(segment_2[1])
              add.write(" ")
              add.write(segment_2[2])
              add.write(" ")
              add.write(segment_2[3])
              add.write("\n")
            break

print "done"

This is a sample of the data

bedgraph2.txt
chr01   1780    1795    -0.811494
chr01   1795    1809    -1.622988
chr01   1809    1829    -2.434482
chr01   1829    1830    -3.245976
chr01   1830    1845    -2.434482
chr01   1845    1859    -1.622988
chr01   1859    1879    -0.811494
chr01   1934    1984    -0.811494
chr01   3550    3600    -0.811494
chr01   3790    3840    -0.811494
chr01   3882    3902    -0.811494
chr01   3902    3932    -1.622988


bedgraph1.txt
chr01   1809    1859    -1.139687
chr01   1965    2015    -1.139687
chr01   3790    3840    -1.139687
chr01   3930    3942    -1.139687
chr01   3942    3980    -2.279375
chr01   3980    3992    -1.139687
chr01   4260    4310    -1.139687
chr01   4361    4382    -1.139687
chr01   4382    4411    -2.279375
chr01   4411    4432    -1.139687
chr01   4473    4523    -1.139687
chr01   4605    4655    -1.139687

Thanks in advance

Original Q&A

There are 1 answers

**Vince** · Accepted Answer · 2015-06-25T11:57:52+00:00

Vince On 25 June 2015 at 11:57 BEST ANSWER

I suggest to use bedtools:

http://bedtools.readthedocs.org/en/latest/

Intersect function likely does what you want.

Also, using bedtools or not, algorithm can be improved by first sorting the two input files.

TechQA.

How to compare two files quickly?

There are 1 answers

Related Questions in PYTHON

Related Questions in FILE

Related Questions in GENOME

Popular Questions

Popular Tags

Trending Questions