I am attempting to create a simple pdf scraper using pandas and pdfquery. I want to take the data I need from each page of the PDF by using the xml coordinates, put it into a dataframe and then save the dataframe as a csv file. I am having problems with the last part, I can get the data from a single pdf/page but can't seem to get it to work over multiple pages. I'm a relative beginner to python so any help is appreciated.
import pdfquery
import pandas as pd
pdf = pdfquery.PDFQuery(r'path')
pdf.load()
pdf.tree.write('pdfXML.txt', pretty_print = True)
def pdfscrape(pdf):
num_1 = pdf.pq('LTTextBoxHorizontal:overlaps_bbox("378.0,
759.06, 456.0, 769.06")').text()
num_2 = pdf.pq('LTTextBoxHorizontal:overlaps_bbox("30.0,
431.06, 360.0, 441.06")').text()
page = pd.DataFrame({ 'num1': num_1,'num2': num_2, },index=[0])
print(page)
pagecount = pdf.doc.catalog['Pages'].resolve()['Count']
master = pd.DataFrame()
for p in range(pagecount):
pdf.load(p)
page = pdfscrape(pdf)
master = master(pd.concat([page], ignore_index=True))
master.to_csv("output.csv", index=False)
The result I expected was a csv file with the data points I needed from each page of the pdf. Instead I got:
Traceback (most recent call last):
master = master(pd.concat([page], ignore_index=True))
line 380, in concat
op = _Concatenator(
line 443, in __init__
objs, keys = self._clean_keys_and_objs(objs, keys)
line 539, in _clean_keys_and_objs
raise ValueError("All objects passed were None")
ValueError: All objects passed were None
What you can do is load the page what you are interessted in:
Output: