Edit a few lines of uncompressed PDF in Python

67 views Asked by At

I want to edit a few lines in an uncompressed pdf. I found a similar problem but since I need to scan the file a few times to get the exact line positions I want to change this doesn't really suit (and the pure number of RegEx matches are more than desired). The pdf contains utf-8 encodable lines (a few of them I want to edit, bookmark target ids in particular) and a lot of blobs (guess images and so on). When I edit the file with notepad it's working fine, but when I do it programatically (reading in, changing a few lines, writing back) images and some formatting is missing. (Sine they are not read in at the firstplace, ignore-option)

with codecs.open("merged-uncompressed.pdf", "r", encoding='ascii', errors='ignore') as f:

I can read the file in with errors="surrogateescape" and wanted to map the lines from above import but don't know if this approach can work.

Does anyone know a way how to deal with this?

Best, Lukas

1

There are 1 answers

0
strikeback_93 On

I was able to solve this:

  1. read the file as binary
  2. marked the lines which couldn't be encoded utf-8
  3. copied the list line by line to a temporary list ( not encodable lines were copied with a placholder 'None\n')
  4. Then I went back to do the searching part on the copied list so I got my lines I wanted to replace
  5. replaced the lines in the original binary list (same indices!)
  6. wrote it back to file
  7. the resulting pdf was a bit corupted because of whitespace before the target ids of the bookmarks but by recompressing qpdf fixed it:)

The code is very messy at the moment and so I don't want to publish it right now. But I want to add it at github within the next few weeks. If anyone needs it: just comment and it will have more priority.

Thanks to anyone who wanted to help:) Lukas