How to query XML with complex types

367 views Asked by At

I am building a program (Visual Studio 2010, .NET 4, C# based console application) to gather specific information from a publicly available government report that is only available as an xml download. Its structure is similar to the following:

<Collections>
<Collection>
<Info id="123456" address="Some Place" name="Some Name"/>
<Items>
<Item1/>
<Item2/>
<Item3 I3="Y"/>
<Item3A I3A1="N" I3A2="N" I3A3 = "Y"/>
<Item3B I3B1="N" I3B2="N"/>
<Item4/>
</Items>
</Collection>
<Collection>...</Collection>
<Collection>...</Collection>
</Collections>

The full file has hundreds of blocks and ranges from 50-100mb. I have never worked with XML formatted even remotely closely to this (it looks awful, right?) and have had a lot of trouble trying to find any examples of queries that are useful.

I need to return the id from the element for all nodes that have a "Y" in the elements Item3 through Item3B. It's driving me a little crazy, because it would be easy if they had matching element names and matching attributes, but they are all unique. You can't include a wildcard in an XPath query like /Item3*[Q3*="Y"].

Does anybody have any ideas on how to tackle this? Thanks!

1

There are 1 answers

1
Mathias Müller On BEST ANSWER

I need to return the id from the element for all nodes that have a "Y" in the elements Item3 through Item3B.

The right answer depends on the exact "rules" for selecting nodes. It's not clear whether you are always looking for Item3 through Item3B or if they are just examples of the rule. I also assume that by "nodes have a 'Y' in the elements" you mean they have an attribute value wich equals "Y".

If you are interested in exactly three element nodes with exactly the names "Item3", "Item3A" and "Item3B", and if the "Y" value can be on any attribute, use

//*[self::Item3 or self::Item3A or self::Item3B][@* = 'Y']

Else, if the rule only says that element names must start with "Item3", use

//*[starts-with(name(),'Item3')][@* = 'Y']

If there are namespaces in your input XML document, it would be safer to use the local-name() function instead of name().

It seems you are also trying to match attributes that start with a certain string:

//*[starts-with(name(),'Item3')][@*[starts-with(name(),'Q3')] = 'Y']

As you can see,

You can't include a wildcard in an XPath query like /Item3*[Q3*="Y"].

is not really true - there are "wildcards" (you don't usually call them wildcards), but you need the right syntax.