I am trying to convert pdf to docx using soffice. It converts it into .docx but it gives textboxes which I am unable to read using the docx api provided by python. Is there any better way to read the file or any better way to convert pdf to docx so that I do not get textboxes?
soffice --infilter="writer_pdf_import" --convert-to docx "convert_this.pdf"
You can try using Aspose.Words for Cloud to convert PDF to Word documents. https://docs.aspose.cloud/display/wordscloud/Convert+PDF+Document+to+Word It converts PDF from fixed form to flow form so it is editable in MS Word.
Disclosure: I work at Aspose.Words team.