I have been using lxml
and objectify
to read XMLs thus far:
- I created custom classes with the same name as tags
- I used for loops to read the objectified xml-tags and mapped them into my custom classes
The above technique worked as I didn't need to preserve the sequence of tags.
However, now, I have a new challenge.
Check the following XML file:
<member>
<detaileddescription>
Hello
<formula id="39">my</formula>
name
<formula id="102">is</formula>
Buddy.
<formula id="103">I</formula>
am a
<itemizedlist>
<listitem>
superhero
<formula id="104">.</formula>
</listitem>
<listitem>
At least,
<formula id="105">I think</formula>
</listitem>
</itemizedlist>
so...:)
<simplesect kind="see">
What
<ref refid="ref_id" kindref="ref_kindref">do you</ref>
<bold>think</bold> ?
</simplesect>
Let me know.
</detaileddescription>
</member>
My task is to read it and also preserve its meaning between the tags.
I have experimented a lot. However, I haven't been able to succeed in finding a way.
from lxml import etree, objectify
def to_list(root):
my_list = []
for item in root.iter():
if item.text is not None:
text = item.text.strip()
if text is not "":
my_list.append("text####" + text)
if item.tail is not None:
tail = item.tail.strip()
if tail is not "":
my_list.append("tail####" + tail)
return my_list
if __name__ == '__main__':
in_file = r"xml.xml"
class_dom = etree.parse(in_file)
class_xml_bin = etree.tostring(class_dom, pretty_print=False, encoding="ascii")
class_xml_text = class_xml_bin.decode()
root = objectify.fromstring(class_xml_text)
my_list = to_list(root.detaileddescription)
for item in my_list:
print(item)
Output:
text####Hello
text####my
tail####name
text####is
tail####Buddy.
text####I
tail####am a
tail####so...:)
text####superhero
text####.
text####At least,
text####I think
text####What
tail####Let me know.
text####do you
text####think
tail####?
Here you can see the output doesn't fully maintain the exact sequence. For instance, so...:)
is out of place.
Another major problem with this solution is, it doesn't keep the XML content as classes. Rather the output is directly text output.
Does anyone have any suggestions?
Note: I must not use xpath.