UnicodeEncodeError while extracting text from pdf using pypdf

19 views Asked by At

I'm trying to extract text from a pdf document using the following python code:

from pypdf import PdfReader

pdf_path = 'Final RFP N00024-20-R-5500 2020-04-24 SPY-6.pdf'
reader=PdfReader(pdf_path)
for page in range(len(reader.pages)):
    print(reader.pages[page].extract_text(extraction_mode="layout"))

It works perfectly until it reaches a page that has a character on it that returns the following error during the print statement above:

File "C:\Programs\Python\Python312\Lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'charmap' codec can't encode character '\uf0b7' in position 2803: character maps to <undefined>`your text`

Question: is there a way to either ignore the error or ideally replace the character with a default character such as �?

I've tried changing the code to:

myText = reader.pages[page].extract_text(extraction_mode="layout")
print(myText.encode("utf-8"))

That works except all of the layout formatting is lost.

0

There are 0 answers