python module to convert doc/pdf/docx/rtf formats to text

4.2k views Asked by At

I am searching google for answers but i could not get one module to convert doc/pdf/docx/rtf to text

Is there any python module to convert doc/pdf/docx/rtf formats to text?

1

There are 1 answers

0
Katherine Mejia-Guerra On

One module to rule them all!

textract. It supports many file types for text extraction, including all the ones that you specified in your question.

  • .doc via antiword
  • .pdf via pdftotext (default) or pdfminer.six
  • .docx via python-docx
  • .rtf via unrtf

PDF example

http://textract.readthedocs.io/en/latest/python_package.html

import textract
text = textract.process('path/to/a.pdf', method='pdfminer')