I have a docx file created by FineReader (FineReader takes a pdf file and creates a text file with the text from the pdf). Now I want to read the file using python3. I used:
with open('file.docx') as file:
for line in file:
print(line)
and this is was the outcome:
at \'ed og \'fd \'fattalast uj; at \'f3 \'fattalast sum ou og \'fa sum yv, ei sum aj, at g og k hava
this is what the text looks like in the docx file when I open it in a text editor:
at í og ý úttalast uj; at ó úttalast sum ou og ú sum yv, ei sum aj, at g og k hava
Any idea how to fix this encoding issue?
Btw. I also tried the follwing:
with open('file.docx', encoding='latin-1') as file:
for line in file:
l = line.encode("utf-8")
print(l)
and this was the outcome:
at \\'ed og \\'fd \\'fattalast uj; at \\'f3 \\'fattalast sum ou og \\'fa sum yv, ei sum aj, at g og k hava
And I tried this:
import unicodedata
with open('file.docx') as file:
for line in file:
unicode = unicodedata.normalize('NFC', line)
print(unicode)
and got:
at \'ed og \'fd \'fattalast uj; at \'f3 \'fattalast sum ou og \'fa sum yv, ei sum aj, at g og k hava
Problem was solved by opening the document with the following: