Match byte spans from an annotation into a text document, Python or Java

139 views Asked by At

I'm using the MPQA opinion corpus in which annotations and documents are saved in separate files. The annotation files contain character offsets (byte spans) into the documents
e.g. 850,861

string  GATE_direct-subjective   
expression-intensity="medium"
attitude-link="a4"
nested-source="w, patient" 
intensity="medium" 
polarity="negative"

How can I match these byte spans into the text document? I'm grateful for any ideas! I prefer using Python but a solution in Java is also fine.

1

There are 1 answers

0
GrantD71 On

I'm not 100% sure I'm understanding the question properly, but if you need a substring and you have character positions the solution is simple.

Python solution:

>>> sometext = "Grant D is a great guy."
>>> character_offset = [0, 7]
>>> subString = sometext[character_offset[0]:character_offset[1]]
>>> print subString
Grant D
>>>