Is there any Python XML parser that was designed with humans in mind?

5.7k views Asked by At

I like Python, but I don't want to write 10 lines just to get an attribute from an element. Maybe it's just me, but minidom isn't that mini. The code I have to write in order to parse something using it looks a lot like Java code.

Is there something that is more user-friendly ? Something with overloaded operators, and which maps elements to objects?

I'd like to be able to access this :


<root>
<node value="30">text</node>
</root>

as something like this :


obj = parse(xml_string)
print obj.node.value

and not using getChildren or some other methods like that.

4

There are 4 answers

2
Etienne On BEST ANSWER

You should take a look at ElementTree. It's not doing exactly what you want but it's a lot better then minidom. If I remember correctly, starting from python 2.4, it's included in the standard libraries. For more speed use cElementTree. For more more speed (and more features) you can use lxml (check the objectify API for your needs/approach).

I should add that BeautifulSoup do partly what you want. There's also Amara that have this approach.

0
stuxnetting On

I spent a fair bit of time going through the examples provided above and through the repositories listed on pip.

The easiest (and most Pythonic) way of parsing XML that I have found so far has been XMLToDict - https://github.com/martinblech/xmltodict

The example from the documentation available at GitHub above is copy-pasted below; It's made life VERY simple and EASY for me a LOT of times;

>>> doc = xmltodict.parse("""
... <mydocument has="an attribute">
...   <and>
...     <many>elements</many>
...     <many>more elements</many>
...   </and>
...   <plus a="complex">
...     element as well
...   </plus>
... </mydocument>
... """)
>>>
>>> doc['mydocument']['@has']
u'an attribute'
>>> doc['mydocument']['and']['many']
[u'elements', u'more elements']
>>> doc['mydocument']['plus']['@a']
u'complex'
>>> doc['mydocument']['plus']['#text']
u'element as well'

It works really well and gave me just what I was looking for. However, if you're looking at reverse transformations, that is an entirely different matter altogether.

0
Dobrosław Żybort On

I had same need for simple xml parser and after a long time spent on checking different libraries I found xmltramp.

Based on your example xml:

import xmltramp

xml_string = """<root>
<node value="30">text</node>
</root>"""

obj = xmltramp.parse(xml_string)
print obj.node('value')             # 30
print str(obj.node)                 # text

I didn't found anything more user-friendly.

0
steveha On

I actually wrote a library that does things exactly the way you imagined it. The library is called "xe" and you can get it from: http://home.avvanta.com/~steveha/xe.html

xe can import XML to let you work with the data in an object-oriented way. It actually uses xml.dom.minidom to do the parsing, but then it walks over the resulting tree and packs the data into xe objects.

EDIT: Okay, I went ahead and implemented your example in xe, so you can see how it works. Here are classes to implement the XML you showed:

import xe

class Node(xe.TextElement):
    def __init__(self, text="", value=None):
        xe.TextElement.__init__(self, "node", text)
        if value is not None:
            self.attrs["value"] = value

class Root(xe.NestElement):
    def __init__(self):
        xe.NestElement.__init__(self, "root")
        self.node = Node()

And here is an example of using the above. I put your sample XML into a file called "example.xml", but you could also just put it into a string and pass the string.

>>> root = Root()
>>> print root
<root/>
>>> root.import_xml("example.xml")
<Root object at 0xb7e0c52c>
>>> print root
<root>
    <node value="30">text</node>
</root>
>>> print root.node.attrs["value"]
30
>>>

Note that in this example, the type of "value" will be a string. If you really need attributes of another type, that's possible too with a little bit of work, but I didn't bother for this example. (If you look at PyFeed, there is a class for OPML that has an attribute that isn't text.)