PDF text extract with Python3.4

Question

PDF text extract with Python3.4

3k views Asked by Tom Liu At 24 June 2015 at 10:11

The texts in the pdf files are text formats, not scanned. PDFMiner does not support python3, is there any other solutions?

Original Q&A

There are 3 answers

**DmcG** · Answer 1 · 2016-02-05T13:30:20+00:00

There is also the pdfminer2 fork, supported for python 3.4, which available through pip3. https://github.com/metachris/pdfminer

This thread helped me patch something together.

from urllib.request import urlopen
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO, BytesIO

def readPDF(pdfFile):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)

    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()
    for page in PDFPage.get_pages(pdfFile, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)

    device.close()
    textstr = retstr.getvalue()
    retstr.close()
    return textstr

if __name__ == "__main__":
    #scrape = open("../warandpeace/chapter1.pdf", 'rb') # for local files
    scrape = urlopen("http://pythonscraping.com/pages/warandpeace/chapter1.pdf") # for external files
    pdfFile = BytesIO(scrape.read())
    outputString = readPDF(pdfFile)
    print(outputString)
    pdfFile.close()

**Shruti Agrawal** · Answer 2 · 2018-10-09T08:53:42+00:00

Shruti Agrawal On 09 October 2018 at 08:53

For python3, you can download pdfminer as:

python -m pip install pdfminer.six

**Siddharth Das** · Answer 3 · 2019-06-20T08:07:36+00:00

tika worked the best for me. It won't be wrong if I say it's better than PyPDF2 and pdfminer This made it really easy to extract each line in the pdf into a list. You can install it by pip install tika And, use the code below:

from tika import parser
rawText = parser.from_file(path_to_pdf)
rawList = rawText['content'].splitlines()
print(rawList)

TechQA.

PDF text extract with Python3.4

There are 3 answers

Related Questions in PDF

Related Questions in PYTHON-3.X

Related Questions in PDF-PARSING

Related Questions in PDFMINER

Popular Questions

Popular Tags

Trending Questions