Extract and recomprise PDF file using Origami

1.6k views Asked by At

This is regarding Origami, the Ruby tool for exploring PDF files at http://esec-lab.sogeti.com/pages/Origami

By way of example I am trying to open a PDF file, extract it and then rewrite the original PDF. This is the complete code I am trying to use to accomplish this:

hg clone https://code.google.com/p/origami-pdf/
cd origami-pdf/
rake
cd ..
curl 'http://www.ada.gov/hospcombrprt.pdf' -o hospcombrprt.pdf
origami-pdf/bin/pdf2ruby -x hospcombrprt.pdf
mv hospcombrprt.pdf hospcombrprtORIG.pdf
cd hospcombrprt
ruby hospcombrprt.rb # THIS STEP PRODUCES ERRORS
bc hospcombrprt.pdf ../hospcombrprtORIG.pdf || echo FAILED

However this produces the following error:

/Users/williamentriken/Developer/origami-pdf/lib/origami/page.rb:75:in `pages': Invalid page tree (Origami::InvalidPDFError)
    from /Users/williamentriken/Developer/origami-pdf/lib/origami/pdf.rb:689:in `compile'
    from /Users/williamentriken/Developer/origami-pdf/lib/origami/pdf.rb:233:in `save'
    from hospcombrprt.rb:189:in `<main>'

Has anyone else had success in performing this operation using this library and could you please share?

1

There are 1 answers

2
Myst On

Original Post:

I played around with the library for a while, but I kept getting errors and minor bugs, such as replicated pages and missing pages...

...you should read the authors comment about the limits of using the Origami library.

I recommend the combine_pdf gem, it's great for simple pdf manipulations, such as merging, stamping and the like.

update:

I looked at the specific PDF file and it might be an issue related to an unsupported PDF version.

The http://www.ada.gov/hospcombrprt.pdf file is encrypted with a type 4 encryption, which according to the PDF standard, starting with PDF 1.5, is:

"(PDF 1.5) The security handler defines the use of encryption and decryption in the document, using the rules specified by the CF, StmF, and StrF entries."

The encryption uses AES v.2, which is limited to PDF 1.6 and above:

"AESV2 (PDF 1.6) The application shall ask the security handler for the encryption key and shall implicitly decrypt data with "Algorithm 1: Encryption of data using the RC4 or AES algorithms", using the AES algorithm in Cipher Block Chaining (CBC) mode with a 16-byte block size and an initialization vector that shall be randomly generated and placed as the first 16 bytes in the stream or string."

So, Even if the decryption code is written in, the way to apply that code might not be known due to the way the PDF file is structured...

...It might be better to start with simple PDF files and then patch anything that isn't supported just yet.