docx file to python in correct format

2.4k views Asked by At

I have a docx file created by FineReader (FineReader takes a pdf file and creates a text file with the text from the pdf). Now I want to read the file using python3. I used:

with open('file.docx') as file:
    for line in file:
        print(line)

and this is was the outcome:

at \'ed og \'fd \'fattalast uj; at \'f3 \'fattalast sum ou og \'fa sum yv, ei sum aj, at g og k hava

this is what the text looks like in the docx file when I open it in a text editor:

at í og ý úttalast uj; at ó úttalast sum ou og ú sum yv, ei sum aj, at g og k hava

Any idea how to fix this encoding issue?


Btw. I also tried the follwing:

with open('file.docx', encoding='latin-1') as file:
    for line in file:
        l = line.encode("utf-8")
        print(l)

and this was the outcome:

at \\'ed og \\'fd \\'fattalast uj; at \\'f3 \\'fattalast sum ou og \\'fa sum yv, ei sum aj, at g og k hava

And I tried this:

import unicodedata
with open('file.docx') as file:
    for line in file:
        unicode = unicodedata.normalize('NFC', line)
        print(unicode)

and got:

at \'ed og \'fd \'fattalast uj; at \'f3 \'fattalast sum ou og \'fa sum yv, ei sum aj, at g og k hava
1

There are 1 answers

0
DevB2F On

Problem was solved by opening the document with the following:

import unicodedata
import docx

doc = docx.Document("file.docx")
for p in doc.paragraphs:
    unicode = unicodedata.normalize('NFC', p.text)
    print(unicode)