Match "byte spans" to a text document, Python

Question

Match "byte spans" to a text document, Python

178 views Asked by Renklauf At 28 October 2011 at 20:21

I'm working with an annotated corpus that contains two sets of .txt files. The first set contains the documents that were annotated (i.e, articles, blog-posts,etc.) and the second set contains the actual annotations. The way to match the annotation to the text annotated is via "byte spans." From the readme file:

"The span is the starting and ending byte of the annotation in 
the document.  For example, the annotation listed above is from 
the document, temp_fbis/20.20.10-3414.  The span of this annotation 
is 730,740.  This means that the start of this annotation is 
byte 730 in the file docs/temp_fbis/20.20.10-3414, and byte 740 
is the character after the last character of the annotation."

So, question: How to do I index the start and end byte in the document so that I can match the annotation to the text in the original document? Any ideas? I'm working in Python on this...

Original Q&A

There are 2 answers

**bpgergo** · Answer 1 · 2011-10-28T20:27:37+00:00

"This means that the start of this annotation is 
byte 730 in the file docs/temp_fbis/20.20.10-3414, and byte 740 
is the character after the last character of the annotation.

     blah, blah, blah, example annotation, blah, blah, blah
                       |                 |
                  start byte          end byte

The data_type of all annotations should be 'string'."

**bpgergo** · Answer 2 · 2011-10-28T20:34:01+00:00

bpgergo On 28 October 2011 at 20:34

#open, seek, read
start, end = 730,740
f = open("myfile", "rb")
try:
    f.seek(start)
    while start > end
        byte = f.read(1)
        # Do stuff with byte.
        start -= 1
finally:
    f.close()

TechQA.

Match "byte spans" to a text document, Python

There are 2 answers

Related Questions in PYTHON

Related Questions in NLP

Related Questions in TAGGED-CORPUS

Popular Questions

Popular Tags

Trending Questions