Trouble reading XML with ElementTree due to xmlns and xsi

38 views Asked by At

I'm reading an XML with python and ElementTree and am struggling with xmlns and xsi tags.

The top of my XML looks like this.

<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type="text/xsl" href="website"?>
<SurveyGroup xmlns:xsi="website2" xmlns:xsd="website3" xsi:schemaLocation="website4 website5" xmlns="website6">
<Survey>
<Header>

I follow the ET process

tree = ET.parse(xmlfile)

root = tree.getroot()

The issue is the xlmns or xsi data appear to be interring with this. I'm unable to access the elements as children of this root and if I print root I get <Element '{website}SurveyGroup' at 0x00000278FCC85120>

If I change the row to just be <SurveyGroup> i don't get this issue.

2

There are 2 answers

0
larsks On

All the elements in your XML document exist in a particular namespace -- either one applied with a specific prefix (like xsi:schemaLocation), or, for elements without a namespace prefix, the default website6 namespace (set by the xmlns=website6 annotation).

If you want to look up elements in that document, you need to specify the appropriate namespace. There are a couple of ways of doing this. You can include the namespace literally in curly brackets, like this:

>>> doc.findall('{website6}Survey')
[<Element '{website6}Survey' at 0x7f02b45699e0>]

You can also refer to namespaces via a namespace prefix:

>>> namespaces={'foo': 'website6'}
>>> doc.findall('foo:Survey', namespaces=namespaces)
[<Element '{website6}Survey' at 0x7f02b45699e0>]

Here, we map the prefix foo to the website6 namespace, so we can use the foo: prefix on element names.


You can set a default namespace in your queries by add an entry to your namespaces dictionary with an empty key:

>>> namespaces={'': 'website6'}
>>> doc.findall('Survey', namespaces=namespaces)
[<Element '{website6}Survey' at 0x7f02b45699e0>]
0
Michael Kay On

The attribute

xsi:schemaLocation="website4 website5"

is a conventional way of saying that the schema for namespace website4 is to be found at location website5.

But that convention only applies if the prefix xsi is bound to the namespace http://www.w3.org/2001/XMLSchema-instance. If, as here, it is bound to a different namespace, such as website2, then it loses this special meaning, and it's just an ordinary attribute that means nothing to the XML parser or validator.

Which means I'm puzzled as to why it should cause a problem. Perhaps the actual namespace in your source is different, and you changed it for your post? And what error are you actually getting? You say "I'm not able to..." but you don't say how it fails.

Probably the reason you are unable to access the children of the root is that you are looking for no-namespace elements, not for elements in namespace website6.