How can I read an XML and also preserve the sequence of tags so that the meaning is intact?

121 views Asked by At

I have been using lxml and objectify to read XMLs thus far:

  1. I created custom classes with the same name as tags
  2. I used for loops to read the objectified xml-tags and mapped them into my custom classes

The above technique worked as I didn't need to preserve the sequence of tags.

However, now, I have a new challenge.

Check the following XML file:

<member>
    <detaileddescription>
        Hello
        <formula id="39">my</formula>
        name
        <formula id="102">is</formula>
        Buddy.
        <formula id="103">I</formula>
        am a
        <itemizedlist>
            <listitem>
            superhero
            <formula id="104">.</formula>
            </listitem>
            <listitem>
                At least,
                <formula id="105">I think</formula>
            </listitem>
        </itemizedlist>
        so...:)
        <simplesect kind="see">
            What
            <ref refid="ref_id" kindref="ref_kindref">do you</ref>
            <bold>think</bold> ?
        </simplesect>
        Let me know.
    </detaileddescription>
 </member>

My task is to read it and also preserve its meaning between the tags.

I have experimented a lot. However, I haven't been able to succeed in finding a way.

from lxml import etree, objectify


def to_list(root):
    my_list = []
    for item in root.iter():
        if item.text is not None:
            text = item.text.strip()
            if text is not "":
                my_list.append("text####" + text)

        if item.tail is not None:
            tail = item.tail.strip()
            if tail is not "":
                my_list.append("tail####" + tail)
    return my_list

if __name__ == '__main__':
    in_file = r"xml.xml"

    class_dom = etree.parse(in_file)
    class_xml_bin = etree.tostring(class_dom, pretty_print=False, encoding="ascii")
    class_xml_text = class_xml_bin.decode()
    root = objectify.fromstring(class_xml_text)

    my_list = to_list(root.detaileddescription)

    for item in my_list:
        print(item)

Output:

text####Hello
text####my
tail####name
text####is
tail####Buddy.
text####I
tail####am a
tail####so...:)
text####superhero
text####.
text####At least,
text####I think
text####What
tail####Let me know.
text####do you
text####think
tail####?

Here you can see the output doesn't fully maintain the exact sequence. For instance, so...:) is out of place.

Another major problem with this solution is, it doesn't keep the XML content as classes. Rather the output is directly text output.

Does anyone have any suggestions?

Note: I must not use xpath.

0

There are 0 answers