I'm using the MPQA opinion corpus in which annotations and documents are saved in separate files. The annotation files contain character offsets (byte spans) into the documents
e.g. 850,861
string GATE_direct-subjective
expression-intensity="medium"
attitude-link="a4"
nested-source="w, patient"
intensity="medium"
polarity="negative"
How can I match these byte spans into the text document? I'm grateful for any ideas! I prefer using Python but a solution in Java is also fine.
I'm not 100% sure I'm understanding the question properly, but if you need a substring and you have character positions the solution is simple.
Python solution: