Match "byte spans" to a text document, Python

186 views Asked by At

I'm working with an annotated corpus that contains two sets of .txt files. The first set contains the documents that were annotated (i.e, articles, blog-posts,etc.) and the second set contains the actual annotations. The way to match the annotation to the text annotated is via "byte spans." From the readme file:

"The span is the starting and ending byte of the annotation in 
the document.  For example, the annotation listed above is from 
the document, temp_fbis/20.20.10-3414.  The span of this annotation 
is 730,740.  This means that the start of this annotation is 
byte 730 in the file docs/temp_fbis/20.20.10-3414, and byte 740 
is the character after the last character of the annotation."

So, question: How to do I index the start and end byte in the document so that I can match the annotation to the text in the original document? Any ideas? I'm working in Python on this...

2

There are 2 answers

0
bpgergo On
"This means that the start of this annotation is 
byte 730 in the file docs/temp_fbis/20.20.10-3414, and byte 740 
is the character after the last character of the annotation.

     blah, blah, blah, example annotation, blah, blah, blah
                       |                 |
                  start byte          end byte

The data_type of all annotations should be 'string'."
0
bpgergo On
#open, seek, read
start, end = 730,740
f = open("myfile", "rb")
try:
    f.seek(start)
    while start > end
        byte = f.read(1)
        # Do stuff with byte.
        start -= 1
finally:
    f.close()