PDF Parsing -- Extract single page

592 views Asked by At

I recently wrote a program in Python that allowed me to read in a PDF, take some commands from the user, and output part or all of the original PDF with pages in different orders. You could also select which pages you were interested in. At the time, there was a great library for it, PyPDF2. It did all of the heavy lifting.

Now, I am working in another language (Haskell) that has basically no PDF support that I can find. I'm considering making my own personal library. However, when looking at the contents of a PDF file, I'm finding it hard to determine where specific pages are. I can tell how many pages total there are in the file, but I can't look at a specific part of the file and say, "This is page X of Y." So, how do I separate out content based on pages? How could I split the file based on pages, if I don't know what page content is on?

1

There are 1 answers

0
David van Driessche On BEST ANSWER

The first thing you need is a copy of the PDF specification. You can download this for free from the Adobe web site here: http://wwwimages.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf

In that document, look at section 7.7.3 which explains how the "Page Tree" works.

Basically, a PDF file contains a tree (Adobe suggests it should be a balanced tree but you're under no obligation to keep to that) starting with a "Pages" object, optionally containing a number of intermediate level objects and ending in "Page" objects. For example:

Pages
. Pages
  . Page (1)
  . Page (2)
  . Page (3)
. Pages
  . Pages
    . Page (4)
    . Page (5)
  . Pages
    . Page (6)
    . Page (7)

The number of levels in this tree is not limited. To find a given page, you have to walk the tree from start to finish, assigning page numbers as you find leaf "Page" objects. I have indicated in the above example what page numbers these objects would represent (starting with page index 1).

Once you have the page object, you can use that (and potentially its parents) to find which resources you need for that page. Look again in the PDF specification for the "Resources" dictionary and mind the discussion about inheritance.