The texts in the pdf files are text formats, not scanned. PDFMiner does not support python3, is there any other solutions?
PDF text extract with Python3.4
3k views Asked by Tom Liu At
3
There are 3 answers
0
On
tika
worked the best for me. It won't be wrong if I say it's better than PyPDF2
and pdfminer
This made it really easy to extract each line in the pdf into a list. You can install it by pip install tika
And, use the code below:
from tika import parser
rawText = parser.from_file(path_to_pdf)
rawList = rawText['content'].splitlines()
print(rawList)
There is also the pdfminer2 fork, supported for python 3.4, which available through pip3. https://github.com/metachris/pdfminer
This thread helped me patch something together.