I have a PDF with multiple text blocks which are misaligned. I am trying to generate a new PDF with aligned text as per my transformation matrix (known). I can use PyMuPDF
(fitz) to extract the text information from the source PDF and insert the text in target PDF, but this way I lose all the structural information (blocks, lines, spans etc.):
import fitz
src_doc = fitz.open('my.pdf')
tgt_doc = fitz.open()
src_page = doc[0]
tgt_page = tgt_doc[0]
text_dict = src_page.get_text('dict')
transform = fitz.Matrix(1, 1) # would be non-identity in practice
tw = fitz.TextWriter(tgt_page.rect)
for block in text['blocks']:
if block['type'] != 1: # ignore images
blocks.append(block)
for line in block['lines']:
for span in line['spans']:
tw.append(span['origin'], span['text'])
tw.write_text(tgt_page, morph=[fitz.Point([0.0, 0.0]), transform])
tgt_doc.save('aligned.pdf')
src_doc.close()
tgt_doc.close()
This does the job of aligning the text, however loses all information about text structure. tgt_page
will have more blocks than src_page
.
Can I do the same without compromising the page structure?
I was originally using pikepdf
as used in ocrmypdf but unfortunately pikepdf
only supports ASCII characters. I am having toruble using it for non-latin text. Any other library that does the job is also okay.