... and I'm interested in the p" /> ... and I'm interested in the p" /> ... and I'm interested in the p"/>

Read XML processing instrutions with attributes

115 views Asked by At

I have an XML file

<?xml version="1.0" encoding="UTF-8"?>
<?foo class="abc" options="bar,baz"?>
<document>
 ...
</document>

and I'm interested in the processing instruction foo and its attributes.

I can use ET.iterparse for reading the PI, but it escapes me how to access the attributes as a dictionary – .attrib only gives an empty dict.

import xml.etree.ElementTree as ET

for _, elem in ET.iterparse("data.xml", events=("pi",)):
    print(repr(elem.tag))
    print(repr(elem.text))
    print(elem.attrib)
<function ProcessingInstruction at 0x7f848f2f7ba0>
'foo class="abc" options="bar,baz"'
{}

Any hints?

3

There are 3 answers

0
Michael Kay On

While the contents of the PI look rather like attributes, this is just a convention that the author of this document has adopted, it's not something defined by the XML spec and therefore it's not something supported in data models like DOM and XDM. They are sometimes called "pseudo-attributes".

You'll either have to parse them yourself by hand, or find a library that does it for you. Saxon has an XPath extension function saxon:get-pseudo-attribute(); other libraries may have something similar.

0
LMC On

Using python lxml module to read PI content, create an element as string and parsing it

>>> from lxml import etree
>>> tree = etree.parse("tmp.xml")
>>> pi = tree.xpath('//processing-instruction("foo")')
>>> pi[0].text
'class="abc" options="bar,baz"'
>>> root = etree.fromstring(f"<root {pi[0].text}/>")
>>> root.get('options')
'bar,baz'

Note: ElementTree skips processing instructions

2
Nico Schlömer On

The string content of the processing instructions can theoretically be anything. In many cases though, it looks like an HTML element with attributes. To parse, one can construct an element as a string from it and parse that, e.g.:

import xml.etree.ElementTree as ET

for _, elem in ET.iterparse("data.xml", events=("pi",)):
    _elem = ET.fromstring(f"<{elem.text}/>")
    _elem.tag
    _elem.attrib