Tika returns garbled text from PDF file

102 views Asked by At

I'm parsing PDF files to TXT. Most of the PDFs are working fine, but one of them returns only the garbled text, like this:

� . LEZI E TVSZIR XVEGO VIGSVH SJ PIEHMRK ERH QIR�
XSVMRK XIEQW SJ WM\ QIQFIVW [MXL ZEV]MRK TVSǻGMIRG] PIZIPW� 2] I\TIVMIRGI MR STXMQM^MRK [IF�FEWIH TVSHYGXW

I use the following code:

Parser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
parser.parse(fileData, handler, metadata, new ParseContext());

What could be the reason for that, and how can it be fixed? I may open this PDF file with no issues with an external viewer.

2

There are 2 answers

0
K J On

Avoid OCR it will not give as good a translation.

The text is roughly
I have a proven track record of leading and men�
That is an odd ending perhaps a hyphen ? as next line starts with toring (men-toring, not men touring).
...teams of six members with varying proficiency levels. My experience in optimizing web-based products....

We have to presume it's a cut and paste CV ! (thus not offered for analysis). I would not give them the job. They do not QA their own work. Simply put it in the nearest recycle bin. What they need is an expert in Anonymous skills-based AI platforms that do not use bots.

You can easily convert using a program, even a command line batch file. The necessary maths is basic but there are a few gotchas such as ǻ = fi so one letter = a pair as a "ligature" but the rest are fairly easy to pick out.

Once you sort out the bulk in lowercase then the Caps should be easy to resolve.

Here is my "Enigma Machine", actually it's a Schoolboys Caesar Transposition. I will let you fill in the gaps via char() addition and subtraction

abcdefghijklmnopqrstuvwxyz 
EFGHIJKLMNOPQRSTUVWXYZ[\]^

ABCDEFGHIJKLMNOPQRSTUVWXYZ
&'()    ./0123456789:;       
Note each grouping may need to slide right or left or split by a character or 2

fi - .
 ǻ ��

If you attempt to do substitution in the source PDF, it will often produce odd outputs.

Self unlOaderS haVe all been Carried Out Within the grOuP.  
Our CaPabilitieS haVe been demOnStrated time and again in  the  SafeT  
 timelyT  On budget deliVery Of the many COmPleX and Often ObSCure  
 COnVerSiOnS that We haVe undertaken. Our CaPabilitieS are reinfOrCed by  
the number Of rePeat CuStOmerS fOr bOth COnVerSiOn and rePair PrOjeCtS.  

Thus best done as extracted plain text where adjustments are easier.

The Japanese have a word for fully garbled text as Mojibake.

プログラムを使用して簡単に変換できますが、コマンドラインのバッチファイルを使用することもできますか? 必要な数学は基本的なものですが、いくつかの問題、例があります

ǻ = fi なので、文字=「合字」としてのペア、 
ǻ = fi ,所以一个字母 = 一对作为 “ligature”,

しかし、残りは簡単に選ぶことができます。 小文字のほとんどを整理したら
3
iPDFdev On

Text display and text extraction in PDF files use different PDF features, this is why one can work and the other doesn't.

The text in your file uses a font that either 1. uses a custom encoding with no ToUnicode cmap or 2. uses a special crafted ToUnicode cmap to fool text extractors. In both situations the file has to be fixed manually.

If you can post a link to the PDF file for inspection, I can tell you the exact problem.