How to ignore table and its content while extracting text from pdf

Question

How to ignore table and its content while extracting text from pdf

2.9k views Asked by go sgenq At 04 May 2021 at 07:29

So far I am successful extracting the text content from a pdf file. I am stuck to a point where i have to extract text content outside of the table (ignore table and its content) and need help

The Pdf can be downloaded from here

import pdfplumber
pdfinstance = pdfplumber.open(r'\List of Reportable Jurisdictions for 2020 CRS information reporting_9 Feb.pdf')

for epage in range(len(pdfinstance.pages)):
    page = pdfinstance.pages[epage]
    text = page.extract_text(x_tolerance=3, y_tolerance=3)
    print(text)

Original Q&A

There are 1 answers

**Samkit Jain** · Answer 1 · 2021-05-13T13:52:56+00:00

For the PDF you have shared, you can use the following code to extract the text outside the tables

import pdfplumber


def not_within_bboxes(obj):
    """Check if the object is in any of the table's bbox."""

    def obj_in_bbox(_bbox):
        """See https://github.com/jsvine/pdfplumber/blob/stable/pdfplumber/table.py#L404"""
        v_mid = (obj["top"] + obj["bottom"]) / 2
        h_mid = (obj["x0"] + obj["x1"]) / 2
        x0, top, x1, bottom = _bbox
        return (h_mid >= x0) and (h_mid < x1) and (v_mid >= top) and (v_mid < bottom)

    return not any(obj_in_bbox(__bbox) for __bbox in bboxes)


with pdfplumber.open("file.pdf") as pdf:
    for page in pdf.pages:
        print("\n\n\n\n\nAll text:")
        print(page.extract_text())

        # Get the bounding boxes of the tables on the page.
        bboxes = [
            table.bbox
            for table in page.find_tables(
                table_settings={
                    "vertical_strategy": "explicit",
                    "horizontal_strategy": "explicit",
                    "explicit_vertical_lines": page.curves + page.edges,
                    "explicit_horizontal_lines": page.curves + page.edges,
                }
            )
        ]

        print("\n\n\n\n\nText outside the tables:")
        print(page.filter(not_within_bboxes).extract_text())

I am using the .filter() method provided by pdfplumber to drop any objects that fall inside the bounding box of any of the tables (in not_within_bboxes(...)) and creating a filtered version of the page which will only contain those objects that fall outside any of the tables.

TechQA.

How to ignore table and its content while extracting text from pdf

There are 1 answers

Related Questions in PYTHON

Related Questions in PDF

Related Questions in PDFPLUMBER

Popular Questions

Popular Tags

Trending Questions