parsing a remote pdf file with Python3 & PyPDF2

2.6k views Asked by At

I need to parse a remote pdf file. With PyPDF2, it can be done by PdfReader(f), where f=urllib.request.urlopen("some-url").read() . f cannot be used by the PdfReader, and it seems that f has to be decoded. What argument should be used in decode(), or some other method has to be used.

2

There are 2 answers

1
Nitin Bhojwani On

You need to use:

f = urllib.request.urlopen("some-url").read()

Add these lines after above line:

from StringIO import StringIO

f = StringIO(f)

and then read using PdfReader as:

reader = PdfReader(f)

Also, refer: Opening pdf urls with pyPdf

2
celsowm On

It is possible to decode using BytesIO:

import urllib, PyPDF2
from io import BytesIO
f = urllib.request.urlopen("https://mypdf.pdf").read()
pdf_bytes = BytesIO(f)
pdf_reader = PyPDF2.PdfFileReader(pdf_bytes)