Extracting correctly the text from a pdf (UTF-8)

1.7k views Asked by At

I want to extract text from some pdf files (programmatically, with some utility or even with copy/paste) but some characters are coming out really strange. Although I specify UTF-8 encoding when extracting the text, characters like "ș, ț, ă," etc look like "„ ˛" and not "s, t, a" (or at least the displayed character). The text is displayed correctly but when I try to copy it for example, those characters are not OK.
Is there some way to extract the text correctly or are those pdf files corrupted in some way (java/C/python etc or windows/linux/etc utility)?

1

There are 1 answers

2
mark stephens On BEST ANSWER

Can you extract the text correctly in Acrobat from the PDF?