docx file to python in correct format

Question

docx file to python in correct format

2.4k views Asked by DevB2F At 25 July 2018 at 18:17

I have a docx file created by FineReader (FineReader takes a pdf file and creates a text file with the text from the pdf). Now I want to read the file using python3. I used:

with open('file.docx') as file:
    for line in file:
        print(line)

and this is was the outcome:

at \'ed og \'fd \'fattalast uj; at \'f3 \'fattalast sum ou og \'fa sum yv, ei sum aj, at g og k hava

this is what the text looks like in the docx file when I open it in a text editor:

at í og ý úttalast uj; at ó úttalast sum ou og ú sum yv, ei sum aj, at g og k hava

Any idea how to fix this encoding issue?

Btw. I also tried the follwing:

with open('file.docx', encoding='latin-1') as file:
    for line in file:
        l = line.encode("utf-8")
        print(l)

and this was the outcome:

at \\'ed og \\'fd \\'fattalast uj; at \\'f3 \\'fattalast sum ou og \\'fa sum yv, ei sum aj, at g og k hava

And I tried this:

import unicodedata
with open('file.docx') as file:
    for line in file:
        unicode = unicodedata.normalize('NFC', line)
        print(unicode)

and got:

at \'ed og \'fd \'fattalast uj; at \'f3 \'fattalast sum ou og \'fa sum yv, ei sum aj, at g og k hava

Original Q&A

There are 1 answers

**DevB2F** · Answer 1 · 2018-07-25T18:42:40+00:00

DevB2F On 25 July 2018 at 18:42

Problem was solved by opening the document with the following:

import unicodedata
import docx

doc = docx.Document("file.docx")
for p in doc.paragraphs:
    unicode = unicodedata.normalize('NFC', p.text)
    print(unicode)

TechQA.

docx file to python in correct format

There are 1 answers

Related Questions in PYTHON-3.X

Related Questions in DOCX

Related Questions in FINEREADER

Popular Questions

Popular Tags

Trending Questions