some data some data <" /> some data some data <" /> some data some data <"/>

LXML don't want to parse text after comment

131 views Asked by At

I want to wrap tag.text into CDATA:

<?xml version="1.0" encoding="utf-8" ?>
<root>
  <tag>
    some data
    <!-- some data2 -->
    <!-- some data2 -->
    some data
  </tag>
</root>

But when i parse tag.text with comments inside it parse only text before comments:

from lxml import etree

parser = etree.XMLParser()
#parser = etree.XMLParser(remove_comments=True)
tree = etree.parse("./data.xml", parser)
root = tree.getroot()

for tag in root.findall("tag"):
    tag.text = etree.CDATA(tag.text)

tree.write("./result.xml",
           encoding = "utf-8",
           xml_declaration = True)

And i get this (tag.text = some data):

<?xml version='1.0' encoding='UTF-8'?>
<root>
  <tag><![CDATA[
    some data
    ]]><!-- some data2 -->
    <!-- some data2 -->
    some data
  </tag>
</root>

How to fix it?

5

There are 5 answers

2
James On

If you want to concatenate all of the text within the <tag> elements, you can use the str.join method on the elements itertext method. This will join all of the text including whitespaces before passing to the CDATA method.

for tag in root.findall("tag"):
    tag.text = etree.CDATA(''.join(tag.itertext()))

The comments are considered child elements of the <tag> element in your example. The tail text is iterated over when using the itertext method.

0
Anaph On

I found tricky way to parse and modify text, comments and tails together:

tmp = etree.tostring(tag).decode()
// here you need to remove <tag> from tmp string
tag.clear()
tag.text = etree.CDATA(tmp)

If someone knows more correct/beautiful way to do this (for example, something like tag.all), please write.

4
Martin Honnen On

Consider to use saxonche and XSLT 3.0:

from saxonche import *

with PySaxonProcessor(license=False) as saxon_proc:
    xslt30_processor = saxon_proc.new_xslt30_processor()

    xslt30_processor.transform_to_file(source_file='sample1.xml', stylesheet_file='serialize-wrap-in-cdata1.xsl', output_file='result-sample1.xml')

XSLT 3 is e.g.

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    exclude-result-prefixes="#all"
    expand-text="yes"
    version="3.0">

  <xsl:param name="cdata-tag-names" as="xs:string*" static="yes" select="'tag'"/>

  <xsl:mode on-no-match="shallow-copy"/>

  <xsl:output method="xml" _cdata-section-elements="{$cdata-tag-names}"/>

  <xsl:template _match="{$cdata-tag-names => string-join(' | ')}">
    <xsl:copy>{serialize(node())}</xsl:copy>
  </xsl:template>

</xsl:stylesheet>

sample1.xml is your input:

<?xml version="1.0" encoding="utf-8" ?>
<root>
  <tag>
    some data
    <!-- some data2 -->
    <!-- some data2 -->
    some data
  </tag>
</root>

Public Gist with the files: https://gist.github.com/martin-honnen/61b91233fd73369d55f392ad4a0cee0b.

Example fiddle using SaxonC HE is at this link.

0
LMC On

Iterate over tag element to get it's text + text representation of comment elements (without tail text) + any tail text (which includes indentation). Then remove that child and populate tag element with CDATA wrapped text.

from lxml import etree

parser = etree.XMLParser()
tree = etree.parse("tmp.xml", parser)
root = tree.getroot()

for s in root.findall("tag"):
    t = s.text
    for ele in s.iterchildren():
        t += etree.tostring(ele, with_tail=False).decode("utf8")
        t += ele.tail
        # remove item
        ele.getparent().remove(ele)
    s.text = etree.CDATA(t)
    #print(etree.tostring(s).decode("utf8"))

print(etree.tostring(tree, with_tail=True).decode("utf8"))

Result

<root>
  <tag><![CDATA[
    some data
    <!-- some data2 -->
    <!-- some data2 -->
    some data
  ]]></tag>
</root>
0
Hermann12 On

xml.etree.ElementTree has ET.iterparse() who detects events, including comments:

import xml.etree.ElementTree as ET
from io import StringIO

xml_file = f"""<?xml version="1.0" encoding="utf-8" ?>
<root>
  <tag>
    some data 1
    <!-- some data2 -->
    <!-- some data3 -->
    some data 4
  </tag>
</root>
"""

f = StringIO(xml_file)

for event, elem in ET.iterparse(f, events=('start','comment')):
    if elem.tag == 'tag' and event == 'start':
        print('Text start', elem.text)
    if '<function Comment' in repr(elem.tag):
        print("Comment", elem.text)

Output:

Text start 
    some data 1
    
    
    some data 4
  
Comment  some data2 
Comment  some data3 

And here the lxml adoption:

from lxml import etree
from io import BytesIO

xml_file = f"""<?xml version="1.0" encoding="utf-8" ?>
<root>
  <tag>
    some data 1
    <!-- some data2 -->
    <!-- some data3 -->
    some data 4
  </tag>
</root>
"""

f = BytesIO(xml_file.encode('utf-8'))

def cdata (text):
    tex = ' '.join(text)
    root = etree.Element('root')
    tag = etree.SubElement(root, 'tag')
    tag.text = etree.CDATA(tex)
    etree.dump(root)
    


tex=[]
for event, elem in etree.iterparse(f, events=('start','comment')):
    if elem.tag == 'tag' and event == 'start':
        tex.append(elem.text.strip())
        
    if '<cyfunction Comment' in repr(elem.tag):
        com = f"<!--{elem.text}-->"
        tex.append(com)
        tex.append(elem.tail.strip())

cdata(tex)

Output:

<root>
  <tag><![CDATA[some data 1 <!-- some data2 -->. <!-- some data3 -->  some data 4]]></tag>
</root>