In pypdf, I can get the total number of pages of a PDF file via:
from pypdf import PdfReader
reader = PdfReader("example.pdf")
no_of_pages = len(reader.pages)
How can I get this using PDFMiner?
I realize you were asking for PDFMiner. However, people coming via Google Search to this question might also be interested in alternatives to PDFMiner.
PyPDF2 is a pure-python alternative that recently improved a lot (e.g. text extraction / decryption):
from PyPDF2 import PdfReader
reader = PdfReader("example.pdf")
pdf_page_count = len(reader.pages)
from pikepdf import Pdf
pdf_doc = Pdf.open('fourpages.pdf')
pdf_page_count = len(pdf_doc.pages)
Using pdfminer.six you just need to import the high level function extract_pages
, convert the generator into a list and take its length.
from pdfminer.high_level import extract_pages
print(len(list(extract_pages(pdf_file))))
Using pdfminer
,import
the necessary modules.
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
Create a PDF parser object associated with the file object.
fp = open('your_file.pdf', 'rb')
parser = PDFParser(fp)
Create a PDF document object that stores the document structure.
document = PDFDocument(parser)
Iterate through the create_pages()
function incrementing each time there is a page.
num_pages = 0
for page in PDFPage.create_pages(document):
num_pages += 1
print(num_pages)
I hate to just leave a code snippet. For context here is a link to the current pdfminer.six repo where you might be able to learn a little more about the
resolve1
method.As you're working with PDFMiner, you might print and come across some
PDFObjRef
objects. Essentially you can useresolve1
to expand those objects (they're usually a dictionary).