PDF Scraping - All Objects Passed were None

85 views Asked by At

I am attempting to create a simple pdf scraper using pandas and pdfquery. I want to take the data I need from each page of the PDF by using the xml coordinates, put it into a dataframe and then save the dataframe as a csv file. I am having problems with the last part, I can get the data from a single pdf/page but can't seem to get it to work over multiple pages. I'm a relative beginner to python so any help is appreciated.

import pdfquery
import pandas as pd
pdf = pdfquery.PDFQuery(r'path')
pdf.load()
pdf.tree.write('pdfXML.txt', pretty_print = True)
def pdfscrape(pdf):
    num_1 = pdf.pq('LTTextBoxHorizontal:overlaps_bbox("378.0, 
759.06, 456.0, 769.06")').text()
    num_2 = pdf.pq('LTTextBoxHorizontal:overlaps_bbox("30.0, 
431.06, 360.0, 441.06")').text()
page = pd.DataFrame({ 'num1': num_1,'num2': num_2, },index=[0])
print(page)
pagecount = pdf.doc.catalog['Pages'].resolve()['Count']
master = pd.DataFrame()
for p in range(pagecount):
    pdf.load(p)
    page = pdfscrape(pdf)
    master = master(pd.concat([page], ignore_index=True))
    master.to_csv("output.csv", index=False)

The result I expected was a csv file with the data points I needed from each page of the pdf. Instead I got:

Traceback (most recent call last):
    master = master(pd.concat([page], ignore_index=True))
line 380, in concat
    op = _Concatenator(
line 443, in __init__
    objs, keys = self._clean_keys_and_objs(objs, keys)
line 539, in _clean_keys_and_objs
    raise ValueError("All objects passed were None")
ValueError: All objects passed were None
1

There are 1 answers

0
Hermann12 On

What you can do is load the page what you are interessted in:

import pdfquery
import pandas as pd

def read_page(*args):
    query1 = (56.8, 771.397, 188.992, 783.397)
    text1 = pdf.pq('LTTextLineHorizontal:overlaps_bbox("%d, %d, %d, %d")' % query1).text()
    print(f"From function call:  {text1}\n")

pdf = pdfquery.PDFQuery('Doc_for_PDF.pdf')
pdf.load() #load all pages for the dataframe
pdf.tree.write('pdfXML.xml', pretty_print = True)

df = pd.read_xml('pdfXML.xml', xpath='.//LTTextLineHorizontal')
#print(df.to_string())
print(df.head())
print()

# load page by page here
for i in range(0, pdf.doc.catalog['Pages'].resolve()['Count']):
    read_page(pdf.load(i))

Output:

        y0       y1  ...  word_margin                        LTTextBoxHorizontal
0  771.397  783.397  ...          0.1  This Text should be scrappt on first page
1  771.397  783.397  ...          0.1    This Text should be scrappt second page
2  729.997  741.997  ...          0.1                                   This not

[3 rows x 9 columns]

From function call:  This Text should be scrappt on first page

From function call:  This Text should be scrappt second page