I have a text file of multiple gigabytes, and the millions of lines are sorted:
aaaaa
bcijkjf
dfsdf
gdfgdfhqiuzhf
zzdiszfhj
How is it possible, without loading the whole file in memory, to search if a line is existent, with a bisection search? (Possibly in O(log n) in the number of lines)
Is there a function like bisect.bisect_left among the lines of a file f = open('file.txt', 'r') in the Python library?
The window would initially be [a, b] = [0, file_size]. Then it would seek in the file at position m=(a+b)/2, look for the next \n, and read the following line l. If the pattern to search is smaller or greater than l (with lexicographic order), then we continue on [m, b] or [a, m]. Before rolling my own, does this exist in Python?
You can use the
mmapbuilt-in module. It provides random access to files (i.e., a file behaves like a large array of bytes stored in the file system). You can find more info here.The function returns
Trueiflineexists within the file,Falseotherwise. Let's use the followingmyfile.txtfile to run a few examples:This function should be much faster than a linear search on a big file.
Note: this code works with
\nendings, and not currently with\r\nWindows-style endings (not necessary for OP).