I'm working with an annotated corpus that contains two sets of .txt files. The first set contains the documents that were annotated (i.e, articles, blog-posts,etc.) and the second set contains the actual annotations. The way to match the annotation to the text annotated is via "byte spans." From the readme file:
"The span is the starting and ending byte of the annotation in
the document. For example, the annotation listed above is from
the document, temp_fbis/20.20.10-3414. The span of this annotation
is 730,740. This means that the start of this annotation is
byte 730 in the file docs/temp_fbis/20.20.10-3414, and byte 740
is the character after the last character of the annotation."
So, question: How to do I index the start and end byte in the document so that I can match the annotation to the text in the original document? Any ideas? I'm working in Python on this...