So far I am successful extracting the text content from a pdf file. I am stuck to a point where i have to extract text content outside of the table (ignore table and its content) and need help
The Pdf can be downloaded from here
import pdfplumber
pdfinstance = pdfplumber.open(r'\List of Reportable Jurisdictions for 2020 CRS information reporting_9 Feb.pdf')
for epage in range(len(pdfinstance.pages)):
page = pdfinstance.pages[epage]
text = page.extract_text(x_tolerance=3, y_tolerance=3)
print(text)
For the PDF you have shared, you can use the following code to extract the text outside the tables
I am using the
.filter()
method provided bypdfplumber
to drop any objects that fall inside the bounding box of any of the tables (innot_within_bboxes(...)
) and creating a filtered version of the page which will only contain those objects that fall outside any of the tables.