If the following string is read and output using lxml, the umlauts are converted to entities.
import xml.etree.ElementTree as ET
root = ET.fromstring("<r><s>Die Häuser haben Dächer.</s></r>")
as_text = ET.tostring(root).decode("utf-8")
print(as_text)
Output:
<r><s>Die Häuser haben Dächer.</s></r>
Expected output:
<r><s>Die Häuser haben Dächer.</s></r>
The umlauts are just an example. I generally want to disable entity conversions and instead keep the raw input symbols.
Can I disable entity conversion? Is there a safe method to reconvert the entities?
The default encoding used by
tostring()
is ASCII in both ElementTree and lxml.To get the expected output, you can use
encoding="unicode"
:References: