I'm parsing PDF files to TXT. Most of the PDFs are working fine, but one of them returns only the garbled text, like this:
� . LEZI E TVSZIR XVEGO VIGSVH SJ PIEHMRK ERH QIR�
XSVMRK XIEQW SJ WM\ QIQFIVW [MXL ZEV]MRK TVSǻGMIRG] PIZIPW� 2] I\TIVMIRGI MR STXMQM^MRK [IF�FEWIH TVSHYGXW
I use the following code:
Parser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
parser.parse(fileData, handler, metadata, new ParseContext());
What could be the reason for that, and how can it be fixed? I may open this PDF file with no issues with an external viewer.
Avoid OCR it will not give as good a translation.
The text is roughly
I have a proven track record of leading and men�That is an odd ending perhaps a hyphen ? as next line starts with toring (men-toring, not men touring).
...teams of six members with varying proficiency levels. My experience in optimizing web-based products....We have to presume it's a cut and paste CV ! (thus not offered for analysis). I would not give them the job. They do not QA their own work. Simply put it in the nearest recycle bin. What they need is an expert in Anonymous skills-based AI platforms that do not use bots.
You can easily convert using a program, even a command line batch file. The necessary maths is basic but there are a few gotchas such as
ǻ=fiso one letter = a pair as a "ligature" but the rest are fairly easy to pick out.Once you sort out the bulk in lowercase then the Caps should be easy to resolve.
Here is my "Enigma Machine", actually it's a Schoolboys Caesar Transposition. I will let you fill in the gaps via
char()addition and subtractionIf you attempt to do substitution in the source PDF, it will often produce odd outputs.
Thus best done as extracted plain text where adjustments are easier.
The Japanese have a word for fully garbled text as Mojibake.