lxml - keep input symbols, disable entity conversion

271 views Asked by At

If the following string is read and output using lxml, the umlauts are converted to entities.

import xml.etree.ElementTree as ET

root = ET.fromstring("<r><s>Die Häuser haben Dächer.</s></r>")
as_text = ET.tostring(root).decode("utf-8")
print(as_text)

Output:

<r><s>Die H&#228;user haben D&#228;cher.</s></r>

Expected output:

<r><s>Die Häuser haben Dächer.</s></r>

The umlauts are just an example. I generally want to disable entity conversions and instead keep the raw input symbols.

Can I disable entity conversion? Is there a safe method to reconvert the entities?

1

There are 1 answers

3
mzjn On BEST ANSWER

The default encoding used by tostring() is ASCII in both ElementTree and lxml.

To get the expected output, you can use encoding="unicode":

as_text = ET.tostring(root, encoding="unicode")
print(as_text)

References: