How to convert a scanned PDF to a DOCX

133 views Asked by At

Hey I'm stuck in a problem. I want to convert a scanned PDF to a docx document WHILE preserving the format. How do I parse layout-parser in such a way that I preserve diagrams and table that are in the scanned PDF.

I tried converting through pytesseract image to hocr but it doesnt handle images. Also the text output is very annoying.

2

There are 2 answers

1
Oppa Oppa On

Create a free trial account for Adobe Acrobat. You have to open your PDF in Adobe Acrobat. Go to “File,” pick “Save As Other,” then choose “Microsoft Word” and “Word Document.” Then choose a name and where to save your Word document.

0
K J On

Word can import PDF scanned pages. Your biggest problem will be what method was used for any OCR, as it needs to be edited to suit the image, thus need manual styling. Like here I use red as a preference, for a scan of this page.

You will possibly need to look at commercial offerings from Abbyy, Acme, Adobe, Apryse, to z-I-got OCR PDF etc.

enter image description here