Extracting correctly the text from a pdf (UTF-8)

Question

Extracting correctly the text from a pdf (UTF-8)

1.7k views Asked by Andrei F At 18 May 2012 at 08:51

I want to extract text from some pdf files (programmatically, with some utility or even with copy/paste) but some characters are coming out really strange. Although I specify UTF-8 encoding when extracting the text, characters like "ș, ț, ă," etc look like "„ ˛" and not "s, t, a" (or at least the displayed character). The text is displayed correctly but when I try to copy it for example, those characters are not OK.
Is there some way to extract the text correctly or are those pdf files corrupted in some way (java/C/python etc or windows/linux/etc utility)?

Original Q&A

There are 1 answers

**mark stephens** · Accepted Answer · 2012-05-18T10:08:10+00:00

mark stephens On 18 May 2012 at 10:08 BEST ANSWER

Can you extract the text correctly in Acrobat from the PDF?

TechQA.

Extracting correctly the text from a pdf (UTF-8)

There are 1 answers

Related Questions in PDF

Related Questions in TEXT

Related Questions in UTF-8

Related Questions in TEXT-EXTRACTION

Related Questions in PDF-EXTRACTION

Popular Questions

Trending Questions