Parse new line character (\n) in the attribute value

Question

Parse new line character (\n) in the attribute value

65 views Asked by privod At 24 March 2024 at 07:02

I am parsing an xml document using the lxml library. There is a new line character (\n) in the attribute value:

from lxml import etree

root = etree.fromstring('<root attr1="line1\nline2"/>')
print(etree.tostring(root).decode())

Result:

<root attr1="line1 line2"/>

That is, the parser replaces the newline character with a space. Is there any way to leave the newline character in the attribute value when parsing?

I know you can add a newline character when creating the xml:

from lxml import etree

root = etree.Element('root', attr1='line1\nline2')
print(root.attrib['attr1'])
print(etree.tostring(root).decode())

Result:

line1
line2
<root attr1="line1&#10;line2"/>

But how to do it when parsing?

Update

The behaviour seems to depend on the OS. The described problem is relevant for Windows, I checked my example on Linux and it appears that the newline characters are preserved. It remains to be seen if there is a way to disable the conversion of the newline character to a space on Windows?

Original Q&A

There are 3 answers

**guegouoguiddel** · Answer 1 · 2024-03-24T07:16:22+00:00

guegouoguiddel On 24 March 2024 at 07:16

I think the below can help:

from lxml import etree, objectify

root = objectify.fromstring('<root attr1="line1\nline2"/>')
attr1_value = root.get('attr1')
print(attr1_value)  # Output: line1\nline2

**aneroid** · Answer 2 · 2024-03-24T11:01:41+00:00

You'll need to handle this manually: See How to save newlines in XML attribute?

If it's in the incoming data, it should already be correctly encoded. In your own example where you add the \n, see the output of etree.tostring(root).decode():

root = etree.Element('root', attr1='line1\nline2')
print(etree.tostring(root).decode())

<root attr1="line1&#10;line2"/>

Note the 
 above, which was replaced for \n automatically.

So if you want the same behaviour but you're constructing the XML string yourself, then you will need to do the replacement yourself:

>>> my_text = '<root attr1="line1\nline2"/>'
>>> my_text_fixed = my_text.replace('\n', '&#10;')
>>> root = etree.fromstring(my_text_fixed)
>>>
>>> root.attrib['attr1']  # without print
'line1\nline2'
>>> print(root.attrib['attr1'])
line1
line2
>>>

**Martin Honnen** · Answer 3 · 2024-03-24T14:52:38+00:00

The XML specification outlines attribute value normalization as part of the standardized XML parsing, that algorithm will basically convert unescaped white space characters to spaces; to ensure line breaks survives attribute value normalization they have to be escaped as a character references in the input markup e.g.

root = etree.fromstring('<root attr1="line1&#10;line2"/>')

then for

print(root.get('attr1'))

you get

line1
line2

TechQA.

Parse new line character (\n) in the attribute value

Update

There are 3 answers

Related Questions in PYTHON

Related Questions in LXML

Popular Questions

Trending Questions