PDF to text mess up latin accents

952 views Asked by At

I have a few pdf's written in Brazillian Portuguese which I'd like to parse and process. I tried using PDFBox text extraction command line tools( with no arguments at all ) but I get the following results:

Cão 

ends up as

C~
ao

Also, copying and pasting the text or exporting it as text using Adobe Reader outputs the same results. Doing the same (PDFBox, copy&paste, Adobe Reader export) with other files I managed to extract the text as expected ("Cão") so , not being the PDF expert, I figure it has to do with the way the files were created. I'd like to know if anyone has seen such behavior and how to work around it when extracting the text.

1

There are 1 answers

0
Grasshopper On BEST ANSWER

So thanks to Stack Overflow I managed to find the post below:

How to get text extraction from PDF to work?

which gave me the information I was looking for. Apparently the PDF's are being generated without the information needed to understand the latin characters.