Python PDFMiner : How to link outlines to underlying text

2.2k views Asked by At

I am trying to parse a PDF and create some kind of a hierarchical structure. Consider the input

Title 1
some text some text some text some text some text some text some text 
some text some text some text some text some text some text some text 

Title 1.1
some more text some more text some more text some more text 
some more text some more text some more text some more text 
some more text some more text 

Title 2
some final text some final text 
some final text some final text some final text some final text 
some final text some final text some final text some final text 

here is how i can extract the outline/titles

path='myFile.pdf'
# Open a PDF file.
fp = open(path, 'rb')
# Create a PDF parser object associated with the file object.
parser = PDFParser(fp)
# Create a PDF document object that stores the document structure.
# Supply the password for initialization.
document = PDFDocument(parser, '')
outlines = document.get_outlines()
for (level,title,dest,a,se) in outlines:
    print (level, title)

this gives me

(1, u'Title 1')
(2, u'Title 1.1')
(1, u'Title 2')

which is perfect, as the levels are aligned with the text hierarchy. Now I can extract the text as follows

if not document.is_extractable:
    raise PDFTextExtractionNotAllowed
# Create a PDF resource manager object that stores shared resources.
rsrcmgr = PDFResourceManager()
# Create a PDF device object.
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
# Create a PDF interpreter object.
interpreter = PDFPageInterpreter(rsrcmgr, device)
# Process each page contained in the document.
text_from_pdf = open('textFromPdf.txt','w')
for page in PDFPage.create_pages(document):
    interpreter.process_page(page)
    layout = device.get_result()
    for element in layout:
        if isinstance(element, LTTextBox):
            text_from_pdf.write(''.join([i if ord(i) < 128 else ' ' for i in element.get_text()]))

which gives me

Title 1
some text some text some text some text some text some text some text 
some text some text some text some text some text some text some text 
Title 1.1
some more text some more text some more text some more text 
some more text some more text some more text some more text 
some more text some more text 
Title 2
some final text some final text 
some final text some final text some final text some final text 
some final text some final text some final text some final text 

which is ok as far as the order goes, but now i have lost all sense of hierarchy. How do i know where a title ends and another begins? Also, who is the parent, if any of a title/heading?

Is there a way to connect the outline information to the layout elements? It would be great to be able to parse all the information while iterating through the levels.

Another problem is that if there are any citations at the bottom of a page, then the citation text gets mixed in with the results. Is there a way to ignore the headers, footers and citations when parsing a PDF?

1

There are 1 answers

0
Suriya Kumar J S On

I hope it is possible but it is clearly stated in the pdfminer document as follow

Some PDF documents use page numbers as destinations, while others use page numbers and the physical location within the page. Since PDF does not have a logical structure, and it does not provide a way to refer to any in-page object from the outside, there’s no way to tell exactly which part of text these destinations are referring to.

https://pdfminer-docs.readthedocs.io/programming.html#:~:text=Some%20PDF%20documents,are%20referring%20to.

Thanks