Background
I'm using pdfquery to scrap data from pdfs. Like this one. This questions builds off my earlier question here.
I have successfully been able to use custom wrapper functions that can take arguments as seen in this answer. Except for the following which is giving me trouble when I try to run it multiple times in jupyter notebook;
Cell 1
import pdfquery
def load_file(PDF_FILE):
pdf = pdfquery.PDFQuery(PDF_FILE)
pdf.load()
return pdf
file_with_table = 'path_to_the_file_mentioned_above.pdf'
pdf = load_file(file_with_table)
Cell 2
def in_range(prop, bounds):
def wrapped(*args, **kwargs):
n = float(this.get(prop, 0))
return bounds[0] <= n <= bounds[1]
return wrapped
def is_element(element_type):
def wrapped(*args, **kwargs):
return this.tag in element_type
return wrapped
def str_len(condition):
def wrapped(*args, **kwargs):
cond = ''.join([str(len(this.text)),condition])
return eval(cond)
return wrapped
Cell 3
x_check = in_range('x0', (97, 160))
y_check = in_range('y0', (250, 450))
el_check = is_element(['LTTextLineHorizontal', 'LTTextBoxHorizontal'])
str_len = str_len('>0')
els = pdf.pq('LTPage[page_index="0"] *').filter(el_check)
els = els.filter(str_len)
els = els.filter(x_check)
els = els.filter(y_check)
[(i.text) for i in els]
The function, str_len
, will work fine if it is run a single time after definition;
No error when running the third cell pictured
but throws a NameError
when I try to run the function a second time;
NameError
after running third cell a second time.
Here is the text of the NameError
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-27-54cd329bb1e1> in <module>()
2 y_check = in_range('y0', (250, 450))
3 el_check = is_element(['LTTextLineHorizontal', 'LTTextBoxHorizontal'])
----> 4 str_len = str_len('>0')
5
6 els = pdf.pq('LTPage[page_index="0"] *').filter(el_check)
<ipython-input-25-654bff7d0eed> in wrapped(*args, **kwargs)
12 def str_len(condition):
13 def wrapped(*args, **kwargs):
---> 14 return eval(''.join([str(len(this.text)),condition]))
15 return wrapped
NameError: name 'this' is not defined
Questions
Why can I only use this function once after it's definition?
Is there anyway that I can circumvent this problem?
Function names are variables like any other; there isn't a separate namespace for functions.
str_len = str_len('>0')
rebinds the namestr_len
to the return value of the call to the original value ofstr_len
. After this line, you no longer have a reference to the function. Use a different name for the computed length: