Decoding problem with fitz.Document in Python 3.7

3.3k views Asked by At

I want to extract the text of a PDF document and use some regular expressions to filter for information.

I am coding in Python 3.7.4 using fitz for parsing the PDF document. The PDF document is written in German. My code looks as follows:

doc = fitz.open(pdfpath)
pagecount = doc.pageCount
page = 0
content = ""

while (page < pagecount):
    p = doc.loadPage(page)
    page += 1
    content = content + p.getText()

Printing the content, I realized that the first (and important) half of the document is decoded as a strange mix of Japanese (?) signs and others, like this: ョ。オウキ・ゥエオョァ@ュ.

I tried to solve it with different decodings (latin-1 and iso-8859-1), but the encoding is definitely in UTF-8.

content= content+p.getText().encode("utf-8").decode("utf-8")

I also have tried to get the text using minecart:

import minecart

file = open(pdfpath, 'rb')

document = minecart.Document(file)

for page in document.iter_pages():
    for lettering in page.letterings :
        print(lettering)

which results in the same problem.

Using textract, the first half is an empty string:

import textract

text = textract.process(pdfpath)
print(text.decode('utf-8'))

Same thing with PyPDF2:

import PyPDF2


pdfFileObj = open(pdfpath, 'rb')

pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
for index in range(0, pdfReader.numPages) :
    pageObj = pdfReader.getPage(index)
    print(pageObj.extractText())

I don't understand the problem as it's looking like a normal PDF document with normal text. Also some of the PDF documents don't have this problem.

0

There are 0 answers