Extracting Docx file from PDF in Python using Tesseract and Python-docx

134 views Asked by Musaib Ahmed Razzaqui At 12 October 2023 at 22:18

Hey is there anyone experienced with converting tesseract results to a docx file while preserving format? Im using pytesseract to convert to hOcr format but unable to parse it down to a docx file. I converted pytesseract to pdf directly and the results are accurate but I want to have an editable docx file. Sorry if this sounds beginner level, I'm starting learning python and want to automate a very hectic process for my company.

I tried using pytesseract.image_to_pdf_or_hocr with the extension pdf and got great results but while converting to docx using pdf2docx library the format is lost. I think there has to be a way using hOcr format and Python-docx as hocr provides bboxes, but unable to figure it out. Any help would be appreciated. Thanks!

Original Q&A

TechQA.

Extracting Docx file from PDF in Python using Tesseract and Python-docx

There are 0 answers

Related Questions in PYTHON-TESSERACT

Related Questions in PYTHON-DOCX

Related Questions in PYMUPDF

Popular Questions

Popular Tags

Trending Questions