Extracting text from Open Document file generates UnicodeEncodeError

716 views Asked by At

I'm trying to convert the notes attached to an Open Document Presentation file to text, using odfpy. I managed to open the file, make a list of 'notes' objects, managed to extract from that what I believe are paragraphs, and it somehow works, until I try to print notes with special characters (German Umlauts öäü), which cause errors:

UnicodeEncodeError: 'ascii' codec can't encode characters in position 17-19: ordinal not in range(128)

Now I figured out that I'm not the first to encounter an encoding problem, and I'd happily dive into re-encoding the text. My problem is that I don't know how to convert the notes to proper strings. Here is my code:

import sys
from odf.presentation import Notes
from odf.opendocument import load
from odf import text

doc=load(sys.argv[1])
slides=doc.presentation
notes=slides.getElementsByType(Notes)

for page in notes:
    pars = page.getElementsByType(text.P)
    for p in pars:
        print p

I simply iterate over the elements and try to print them, hoping that magically the text from the notes will appear. I have deposited a sample presentation file at https://spideroak.com/browse/share/enno_middelberg/public/public to illustrate the issue.

Can anyone enlighten me how to get the text out of the ODF elements and into a string?

Many thanks,

Enno

1

There are 1 answers

0
alexanderlukanin13 On BEST ANSWER

str(p) fails because p contains non-ascii text.

Use print unicode(p)