Transform text contents of a PDF

209 views Asked by At

I have a PDF with multiple text blocks which are misaligned. I am trying to generate a new PDF with aligned text as per my transformation matrix (known). I can use PyMuPDF (fitz) to extract the text information from the source PDF and insert the text in target PDF, but this way I lose all the structural information (blocks, lines, spans etc.):

import fitz

src_doc = fitz.open('my.pdf')
tgt_doc = fitz.open()

src_page = doc[0]
tgt_page = tgt_doc[0]

text_dict = src_page.get_text('dict')
transform = fitz.Matrix(1, 1) # would be non-identity in practice
tw = fitz.TextWriter(tgt_page.rect)

for block in text['blocks']:
    if block['type'] != 1: # ignore images
        blocks.append(block)
        for line in block['lines']:
            for span in line['spans']:                      
                tw.append(span['origin'], span['text'])

tw.write_text(tgt_page, morph=[fitz.Point([0.0, 0.0]), transform])

tgt_doc.save('aligned.pdf')
src_doc.close()
tgt_doc.close()

This does the job of aligning the text, however loses all information about text structure. tgt_page will have more blocks than src_page.

Can I do the same without compromising the page structure?

I was originally using pikepdf as used in ocrmypdf but unfortunately pikepdf only supports ASCII characters. I am having toruble using it for non-latin text. Any other library that does the job is also okay.

0

There are 0 answers