How to decode PDF file and encode it back?

6.3k views Asked by At

My overall goal is to make some PDF files conform to the PDF/A standard for archival purposes. They fail one requirement, namely that some glyph mappings map to 0, which they should not.

My usual strategy was to use an old software called "Pdfedit" that could decode PDF-Files, all the byte-streams would then be human-readable, edit the relevant part of the PDF containing the glyph mappings, and open the file with Adobe Acrobat that automatically re-encoded it.

Now I have some PDFs that cause "Pdfedit" to crash upon opening. I tried using PDF-Parser but its output cannot be re-encoded by Adobe Acrobat.

Also, the relevant parts used to look like this decoded:

/CMapType 2 def
 1 begincodespacerange
 <00><04>
 endcodespacerange
 5 beginbfchar
 <00><0000>
 <01><0000>
 <02><263A>
 <03><0000>
 <04><0000>
 endbfchar
 endcmap

But now I use the following command python3 pdf-parser.py -f -n /path/to/file.pdf > dump.txt and inside dump.txt the relevant part looks like this:

b'/CMapType 2 def\n1 begincodespacerange\n<00><04>\nendcodespacerange\n5 beginbfchar\n<00><0000>\n<01><0000>\n<02><263A>\n<03><0000>\n<04><0000>\nendbfchar\nendcmap\nCMapName currentdict/CMap defineresource pop end end'

So it is a bytestring and any linebreak is rendered literally as \n. The txt file that contains this cannot be interpreted as a PDF by Adobe Acrobat.

I have now also realized that many elements such as %%EOF are delimited by ''.

The true issue is how to get an Acrobat-readable output from pdf-parser.py, as the shell-command > does not work and stdout in the shell is also faulty.

I will try out a few things but could really need some help on this!

1

There are 1 answers

0
Smogshaik On

Answering my own question in case this is relevant for someone down the line.

Didier Stevens, the dev behind the pdf-parser, answered that his tool is not made for this. He recommended qpdf instead.

That was indeed the solution. Make sure you use the flag --stream-data=uncompress so that compressed parts are also accessible in the output. The command to use with qpdf is:

qpdf old_file.pdf --stream-data=uncompress --decode-level=all new_file.txt

You can output new_file also as .pdf. In any case you will be able to open it in the text editor. Once you're done applying the changes you wish to apply, you can change the ending to pdf and process it further with acrobat or any other conversion program.