Parse new line character (\n) in the attribute value

65 views Asked by At

I am parsing an xml document using the lxml library. There is a new line character (\n) in the attribute value:

from lxml import etree

root = etree.fromstring('<root attr1="line1\nline2"/>')
print(etree.tostring(root).decode())

Result:

<root attr1="line1 line2"/>

That is, the parser replaces the newline character with a space. Is there any way to leave the newline character in the attribute value when parsing?

I know you can add a newline character when creating the xml:

from lxml import etree

root = etree.Element('root', attr1='line1\nline2')
print(root.attrib['attr1'])
print(etree.tostring(root).decode())

Result:

line1
line2
<root attr1="line1&#10;line2"/>

But how to do it when parsing?

Update

The behaviour seems to depend on the OS. The described problem is relevant for Windows, I checked my example on Linux and it appears that the newline characters are preserved. It remains to be seen if there is a way to disable the conversion of the newline character to a space on Windows?

3

There are 3 answers

1
guegouoguiddel On

I think the below can help:

from lxml import etree, objectify

root = objectify.fromstring('<root attr1="line1\nline2"/>')
attr1_value = root.get('attr1')
print(attr1_value)  # Output: line1\nline2
0
aneroid On

You'll need to handle this manually: See How to save newlines in XML attribute?

If it's in the incoming data, it should already be correctly encoded. In your own example where you add the \n, see the output of etree.tostring(root).decode():

root = etree.Element('root', attr1='line1\nline2')
print(etree.tostring(root).decode())
<root attr1="line1&#10;line2"/>

Note the &#10; above, which was replaced for \n automatically.

So if you want the same behaviour but you're constructing the XML string yourself, then you will need to do the replacement yourself:

>>> my_text = '<root attr1="line1\nline2"/>'
>>> my_text_fixed = my_text.replace('\n', '&#10;')
>>> root = etree.fromstring(my_text_fixed)
>>>
>>> root.attrib['attr1']  # without print
'line1\nline2'
>>> print(root.attrib['attr1'])
line1
line2
>>>
0
Martin Honnen On

The XML specification outlines attribute value normalization as part of the standardized XML parsing, that algorithm will basically convert unescaped white space characters to spaces; to ensure line breaks survives attribute value normalization they have to be escaped as a character references in the input markup e.g.

root = etree.fromstring('<root attr1="line1&#10;line2"/>')

then for

print(root.get('attr1'))

you get

line1
line2