I need to be able to compare the two coordinates (the 2nd and 3rd word in a line) to see where they overlap. Now, my code does it, but it does it very slow. So far for a file with 10000 lines my code takes about two minutes. I need to use it for a file with 3 billion lines, which I estimate will take forever. Is there a way to refactor my code to be so much faster?
So far I can do exactly what I want. Which is this:
import os.path
with open("Output.txt", "w") as result:
with open("bedgraph2.txt") as file1:
for f1_line in file1:
segment_1 = f1_line.split()
with open("bedgraph1.txt") as file2:
for f2_line in file2:
segment_2 = f2_line.split()
if (int(segment_1[2]) > int(segment_2[1])) & (int(segment_1[1]) < int(segment_2[2])):
with open("Output.txt", "a") as add:
add.write(segment_1[0])
add.write(" ")
add.write(segment_1[1])
add.write(" ")
add.write(segment_1[2])
add.write(" ")
add.write(segment_1[3])
add.write(" | ")
add.write(segment_2[0])
add.write(" ")
add.write(segment_2[1])
add.write(" ")
add.write(segment_2[2])
add.write(" ")
add.write(segment_2[3])
add.write("\n")
break
print "done"
This is a sample of the data
bedgraph2.txt
chr01 1780 1795 -0.811494
chr01 1795 1809 -1.622988
chr01 1809 1829 -2.434482
chr01 1829 1830 -3.245976
chr01 1830 1845 -2.434482
chr01 1845 1859 -1.622988
chr01 1859 1879 -0.811494
chr01 1934 1984 -0.811494
chr01 3550 3600 -0.811494
chr01 3790 3840 -0.811494
chr01 3882 3902 -0.811494
chr01 3902 3932 -1.622988
bedgraph1.txt
chr01 1809 1859 -1.139687
chr01 1965 2015 -1.139687
chr01 3790 3840 -1.139687
chr01 3930 3942 -1.139687
chr01 3942 3980 -2.279375
chr01 3980 3992 -1.139687
chr01 4260 4310 -1.139687
chr01 4361 4382 -1.139687
chr01 4382 4411 -2.279375
chr01 4411 4432 -1.139687
chr01 4473 4523 -1.139687
chr01 4605 4655 -1.139687
Thanks in advance
I suggest to use bedtools:
http://bedtools.readthedocs.org/en/latest/
Intersect function likely does what you want.
Also, using bedtools or not, algorithm can be improved by first sorting the two input files.