Extracting Docx file from PDF in Python using Tesseract and Python-docx

144 views Asked by At

Hey is there anyone experienced with converting tesseract results to a docx file while preserving format? Im using pytesseract to convert to hOcr format but unable to parse it down to a docx file. I converted pytesseract to pdf directly and the results are accurate but I want to have an editable docx file. Sorry if this sounds beginner level, I'm starting learning python and want to automate a very hectic process for my company.

I tried using pytesseract.image_to_pdf_or_hocr with the extension pdf and got great results but while converting to docx using pdf2docx library the format is lost. I think there has to be a way using hOcr format and Python-docx as hocr provides bboxes, but unable to figure it out. Any help would be appreciated. Thanks!

0

There are 0 answers