HYPERLINK "target"label
How can i extract hyperlinks from a HWPF document? I can get paragraphs from the doc file and extract the correct styling if necessary, i.e. bold, italic etc. But how would i identify and extract hyperlinks from a paragraph?
HYPERLINK "target"label
How can i extract hyperlinks from a HWPF document? I can get paragraphs from the doc file and extract the correct styling if necessary, i.e. bold, italic etc. But how would i identify and extract hyperlinks from a paragraph?
The .doc format doesn't store hyperlinks in the simplest of ways, as you've noticed...
A Hyperlink will be a single CharacterRun, with special markers on it. Once you have detected it, just split up the text based on the quotes.
There's a good example of doing this in Apache Tika, look at the handleSpecialCharacterRuns method of WordExtractor to see it done.