Can I use bisect to print the content of a line?

267 views Asked by At

I have a file where each line is ordered alphabetically. The file is 12Gb, which means I can't simply read it line by line. The data looks like this:

brown    0    1    0    1    2
fox      3    5    0    0    1
jumped   2    0    6    1    0

The words at the beginning of each line are unique. The word and the numbers on each line are separated by tabs. I want to be able to query the file for specific keywords. For example, if I query "fox", the program should return "fox 3 5 0 0 1".

It seems that a good candidate for this would be the bisect module: https://docs.python.org/3.0/library/bisect.html

I found a post which uses bisect to find out the line number of a keyword: How do I perform binary search on a text file to search a keyword in python?

This is what the code looks like:

import bisect
import os

class Query(object):
    def __init__(self, query, index=5):
        self.query = query
        self.index = index

    def __lt__(self, comparable):
        return self.query < comparable[self.index:]

class FileSearcher(object):
    def __init__(self, file_pointer, record_size=35):
        self.file_pointer = file_pointer
        self.file_pointer.seek(0, os.SEEK_END)
        self.record_size = record_size + len(os.linesep)
        self.num_bytes = self.file_pointer.tell()
        self.file_size = (self.num_bytes // self.record_size)

    def __len__(self):
        return self.file_size

    def __getitem__(self, item):
        self.file_pointer.seek(item * self.record_size)
        return self.file_pointer.read(self.record_size)

with open('myfile') as file_to_search:
    query = 'fox\t' #token to query
    wrapped_query = Query(query)
    searchable_file = FileSearcher(file_to_search)
    linepos = bisect.bisect(searchable_file, wrapped_query)
    print "Located @ line: ", linepos
    #print content of line?

However, I can't figure out how to actually print the content of the line. I should at least add a read statement somewhere, but I don't know where.

Is it possible to print the content of the line with the bisect module?

3

There are 3 answers

2
gboffi On

The following recursive function should be able to narrow the search interval. I'm not sure that you can modify it so that it returns a match or None for no match.

def bisearch(f, word, i, j)
    if (j-1)<1E6: return i,j

    k = (i+j)/2
    f.seek(k)


    while k<j:
        c = f.read(1)
        k = k+1
        if c == '\n': break
    else:
        # ??? no match ??? I'm not sure

    w = []
    while 1:
        c = f.read(1)
        if c == '\t': break
        w.append(c)

    w = "".join(w)
    if w == word:
         return k, k
    if w < word:
         return bisearch(f, word, k, j)
    else:
         return bisearch(f, word, i, k)

and here an example of usage

word = ...
f = open(...)

i,j = bisearch(f, word, 0, len_f)

f.seek(i)
if i==j:
    line = f.readline()
else:
#################### EDIT ################
#   OLD
#   buffer = f.read(1E6)
#   NEW
    buffer = f.read(j-i)
    lenw = len(word)
    for line in buffer.split('\n'):
        if line[:lenw] == word: break
    else:
        # no matches, SOS

result = process(line)
0
Kevin On

Try seeking to the line in question and using readline.

print "Located @ line: ", linepos
file_to_search.seek(linepos)
line = file_to_search.readline()

This is assuming linepos is the position of the line, counted in bytes from the beginning of the file. If it's the position counted in line numbers, you'll need to multiply by the number of bytes per line before seeking.

print "Located @ line: ", linepos
file_to_search.seek(linepos * searchable_file.record_size)
line = file_to_search.readline()
3
Anton Zuenko On

If you want go with Python solution, you can do the following:

  • Read file by small chunks of MAX_LINE bytes, each time moving forward by fixed offset
  • That offset determines block size
  • For each such read, determine the key (first word in a line)
  • These keys serve as delimiters of blocks
  • Construct the list of such keys. The list would be sorted as keys are ordered
  • You may persist such list somewhere via pickle/json.dumps/...
  • When quering, find via bisect the index of a block where you key is located
  • Read that block entirely and find the key with data

Here is the example file bigfile:

abc 4
bar 2
baz 3
egg 6
foo 1
god 8
ham 5
sex 7

The code:

import os
from bisect import bisect

MAX_LINE = 7
BLOCK_SIZE = 10

def parse_chunks(filename):

    size = os.path.getsize(filename)
    chunks = []

    with open(filename, 'rb') as file:
        block = str(file.read(MAX_LINE*2))
        first_line = block[:block.find('\n') + 1]
        chunks.append(first_line.split()[0])

        pos = BLOCK_SIZE
        while pos < size:
            file.seek(pos)
            block = str(file.read(MAX_LINE*2))
            first_eol = block.find('\n')
            second_eol = block.find('\n', first_eol + 1)
            if first_eol == -1 or second_eol == -1:
                break

            line = block[first_eol + 1:second_eol]

            key = line.split()[0]
            chunks.append(key)

            pos += BLOCK_SIZE

    return chunks


if __name__ == '__main__':
    BLOCK_SIZE = 10
    filename = 'bigfile'
    chunks = parse_chunks(filename)

    query = 'abc'
    pos_before = bisect(chunks, query) - 1

    with open(filename, 'rb') as file:
        file.seek(pos_before*BLOCK_SIZE)
        block = str(file.read(BLOCK_SIZE + MAX_LINE))
        line_start = block.find(query)
        line_end = block.find('\n', line_start + 1)
        line = block[line_start:line_end]

        print(line)

In this toy example I use block size of 10 bytes, in your case of 12GB file I'd suggest you to start with 1M.