What are known limitations of borb related to PDF versions?

372 views Asked by At

I'm new to borb, which seems to me a very promising Python package.

Trying to load a small sample of PDF documents, just to put hands on, I've found that borb can open some of them without problems; in some cases I got messages such as "Unable to process XMP meta-data"; yet in other cases I got assertion errors.

Thus, before posting specific issues, I'm looking for information about current limitations of borb, with reference to PDF versions, and on tools I could use first to detect files to be considered invalid PDF documents. Thanks.

I'm using borb release v2.0.20, just cloned from GitHub, and Python 3.6.5 on Windows 10.

1

There are 1 answers

0
Joris Schellekens On

Disclaimer: I am Joris Schellekens, author of the aforementioned library borb.

The problem is that the PDF spec (ISO-32000) leaves some room for interpretation at various points throughout. That means some PDF libraries will interpret the spec in a given way, and produce documents that may not always be compliant according to other tools.

borb tends to be very strict when it comes to PDF parsing. As soon as an error is detected, it will throw the stacktrace right back at you. Whereas other PDF software (e.g. Adobe Reader) tend to be much more forgiving in terms of what they accept as input PDF documents.

Although I certainly understand your frustration at being unable to process what you perceive to be "perfectly good PDF documents", I assure you that processing them might lead to even more issues.

I know for instance that there are cases where Adobe Reader tries to correct a bad PDF document, and as a result ends up corrupting the signatures in the document (very undesirable).

If you experience issues, and you can share the PDF, feel free to log a ticket on the GitHub repository.

From the top of my head, the current limitations of borb are:

  • signatures
  • encrypted PDF documents
  • XREF not found
  • some images with transparent pixels