Former title: "TypeError: PageObject.extract_text() got an unexpected keyword argument 'visitor_text'"
Trying to follow PyPDF Documentation's example here on using a visitor to extract text.
Environment
environment.yml file contents, slightly changed:
name: C:\Users\[my_username]\src\repos\read-pdf\envs\read-pdf-env
channels:
- conda-forge
- defaults
dependencies:
- asttokens=2.0.5=pyhd3eb1b0_0
- beautifulsoup4=4.12.2=py312haa95532_0
- brotli-python=1.0.9=py312hd77b12b_7
- bzip2=1.0.8=he774522_0
- ca-certificates=2023.12.12=haa95532_0
- certifi=2023.11.17=py312haa95532_0
- cffi=1.16.0=py312h2bbff1b_0
- charset-normalizer=2.0.4=pyhd3eb1b0_0
- colorama=0.4.6=py312haa95532_0
- comm=0.1.2=py312haa95532_0
- cryptography=41.0.7=py312h89fc84f_0
- debugpy=1.6.7=py312hd77b12b_0
- decorator=5.1.1=pyhd3eb1b0_0
- defusedxml=0.7.1=pyhd3eb1b0_0
- executing=0.8.3=pyhd3eb1b0_0
- expat=2.5.0=hd77b12b_0
- fpdf=1.7.2=pyhd8ed1ab_0
- fpdf2=2.5.6=pyhd8ed1ab_0
- freetype=2.12.1=ha860e81_0
- giflib=5.2.1=h8cc25b3_3
- idna=3.4=py312haa95532_0
- ipykernel=6.29.0=pyha63f2e9_0
- ipython=8.20.0=py312haa95532_0
- jedi=0.18.1=py312haa95532_1
- jpeg=9e=h2bbff1b_1
- jupyter_client=8.6.0=py312haa95532_0
- jupyter_core=5.5.0=py312haa95532_0
- lerc=3.0=hd77b12b_0
- libdeflate=1.17=h2bbff1b_1
- libffi=3.4.4=hd77b12b_0
- libpng=1.6.39=h8cc25b3_0
- libsodium=1.0.18=h62dcd97_0
- libtiff=4.5.1=hd77b12b_0
- libwebp=1.3.2=hbc33d0d_0
- libwebp-base=1.3.2=h2bbff1b_0
- lz4-c=1.9.4=h2bbff1b_0
- matplotlib-inline=0.1.6=py312haa95532_0
- nest-asyncio=1.5.6=py312haa95532_0
- openjpeg=2.4.0=h4fc8c34_0
- openssl=3.0.12=h2bbff1b_0
- packaging=23.1=py312haa95532_0
- parso=0.8.3=pyhd3eb1b0_0
- pdfminer=20191125=pyhd8ed1ab_1
- pdfminer.six=20231228=pyhd8ed1ab_0
- pillow=10.0.1=py312h045eedc_0
- pip=23.3.1=py312haa95532_0
- platformdirs=3.10.0=py312haa95532_0
- prompt-toolkit=3.0.43=py312haa95532_0
- prompt_toolkit=3.0.43=hd3eb1b0_0
- psutil=5.9.0=py312h2bbff1b_0
- pure_eval=0.2.2=pyhd3eb1b0_0
- pycparser=2.21=pyhd3eb1b0_0
- pycryptodome=3.15.0=py312h2bbff1b_0
- pygments=2.15.1=py312haa95532_1
- pyopenssl=23.2.0=py312haa95532_0
- pypdf=4.0.0=pyhd8ed1ab_0
- pypdf2=2.10.5=py312haa95532_0
- pysocks=1.7.1=py312haa95532_0
- python=3.12.0=h1d929f7_0
- python-dateutil=2.8.2=pyhd3eb1b0_0
- pywin32=305=py312h2bbff1b_0
- pyzmq=25.1.2=py312hd77b12b_0
- requests=2.31.0=py312haa95532_0
- setuptools=68.2.2=py312haa95532_0
- six=1.16.0=pyhd3eb1b0_1
- soupsieve=2.5=py312haa95532_0
- sqlite=3.41.2=h2bbff1b_0
- stack_data=0.2.0=pyhd3eb1b0_0
- svg.path=6.3=pyhd8ed1ab_0
- tk=8.6.12=h2bbff1b_0
- tornado=6.3.3=py312h2bbff1b_0
- traitlets=5.7.1=py312haa95532_0
- typing_extensions=4.9.0=py312haa95532_1
- tzdata=2023d=h04d1e81_0
- urllib3=1.26.18=py312haa95532_0
- vc=14.2=h21ff451_1
- vs2015_runtime=14.27.29016=h5e58377_2
- wcwidth=0.2.5=pyhd3eb1b0_0
- wheel=0.41.2=py312haa95532_0
- win_inet_pton=1.1.0=py312haa95532_0
- xz=5.4.5=h8cc25b3_0
- zeromq=4.3.5=hd77b12b_0
- zlib=1.2.13=h8cc25b3_0
- zstd=1.5.5=hd43e919_0
prefix: C:\Users\[my_username]\src\repos\read-pdf\envs\read-pdf-env
VS Code Integrated Terminal
$ C:\Users\[my_username]\src\repos\read-pdf\envs\read-pdf-env\python.exe -m platform
Windows-11-10.0.22621-SP0
$ C:\Users\[my_username]\src\repos\read-pdf\envs\read-pdf-env\python.exe -c "import pypdf;print(pypdf._debug_versions)"
pypdf==4.0.0, crypt_provider=('cryptography', '41.0.7'), PIL=10.0.1
Code + PDF
This is a minimal, complete example that shows the issue:
from pypdf import PdfReader
filepath = r"C:\Users\[my_username]\src\repos\read-pdf\data\raw\GeoBase_NHNC1_Data_Model_UML_EN.pdf"
reader = PdfReader(filepath)
page = reader.pages[3]
parts = []
def visitor_body(text, cm, tm, font_dict, font_size):
y = cm[5]
if y > 50 and y < 720:
parts.append(text)
page.extract_text(visitor_text=visitor_body)
text_body = "".join(parts)
print(text_body)
Sharing here the PDF file that I used.
Traceback
There is no traceback. No output whatsoever. Unless you consider the fact that, if I copy cell output from VSCode's Jupyter Extension, I get a newline if the output is pasted in notepad, like:
Previous version of this question, a Traceback was produced. That is not the case anymore.
I have tried changing page number by editing the hard-coded numeric value on line 6:
page = reader.pages[3]
The Tutorial does not say any output is expected, but page 4 is Table of Contents.
It's a documentation issue
They should've used text_matrix
tminstead of current_matrixcmin the documentation.Just change
y = cm[5]toy = tm[5]and the code will work.Here's the same code but modified: