Itext get special letters from pdf

290 views Asked by At

I am trying to extract accented words from pdf e book . The best results are produced when using itext library , but I fail to get accents from words . example :

побеђивање -should come out as- побеђи́ва̄ње (accents are missing)

The letters are Cyrillic Serbian . I tried many of the ocr solutions but they all give bad results . Is there a way for me to extract all of this pdf data the way they are in the pdf using itext. I know that this has a lot to do with the way pdf works and that this is a hard thing to get , but again I realy need this , the alternative is to retype all of the data. The pdf file pdf example file

1

There are 1 answers

0
mkl On BEST ANSWER

The sample document actually contains one big image, a scanned page, and invisible text information on top of the scanned printed letters. Most likely this text information is the result of some OCR process.

Unfortunately already this text information is missing the accents in question. E.g. the text for the first entry

асталчнћ м дем. од астал.

is added as

(\340\361\362\340\353\367\355)Tj 0 Tc (\236)Tj
...

As you can see, the same letter \340 is used at position 1 and 4 while according to the scanned page one of the matching printed letters has an accent and one not.

This happens throughout the whole page.

Thus, any attempt at regular text extraction will fail to return the accents in question. The only chance you have is to use OCR.

You say you

tried many of the ocr solutions but they all give bad results

Probably you applied the OCR applications to the PDF or a rendered version of it. I would suggest you instead extract the scanned images; this way you get all the quality there is. iText can help you with image extraction.