I am trying to extract accented words from pdf e book . The best results are produced when using itext library , but I fail to get accents from words . example :
побеђивање -should come out as- побеђи́ва̄ње (accents are missing)
The letters are Cyrillic Serbian . I tried many of the ocr solutions but they all give bad results . Is there a way for me to extract all of this pdf data the way they are in the pdf using itext. I know that this has a lot to do with the way pdf works and that this is a hard thing to get , but again I realy need this , the alternative is to retype all of the data. The pdf file pdf example file
The sample document actually contains one big image, a scanned page, and invisible text information on top of the scanned printed letters. Most likely this text information is the result of some OCR process.
Unfortunately already this text information is missing the accents in question. E.g. the text for the first entry
is added as
As you can see, the same letter
\340
is used at position 1 and 4 while according to the scanned page one of the matching printed letters has an accent and one not.This happens throughout the whole page.
Thus, any attempt at regular text extraction will fail to return the accents in question. The only chance you have is to use OCR.
You say you
Probably you applied the OCR applications to the PDF or a rendered version of it. I would suggest you instead extract the scanned images; this way you get all the quality there is. iText can help you with image extraction.