Should I Use Regex to Find the XML Namespace Definition?

778 views Asked by At

The script below works. It parses a XML and looks up a particular node under the namespace "dei".

But is relying on regex for the namespace definition the proper way? (I do not really know XML. So I worry that such regex is not fool-proof for all Edgar XMLs. For example -- are such definitions always enclosed in double quotes and preceded by xmlns: ?)

Thanks.

use strict;
use warnings;

use LWP::Simple;
use XML::LibXML;
use XML::LibXML::XPathContext;

my $url = 'https://www.sec.gov/Archives/edgar/data/1057051/000119312517099664/acef-20161231.xml';
my $xml = LWP::Simple::get($url);
my $dom = XML::LibXML->load_xml(string => $xml);

my @nsDefs = ($xml =~ /xmlns:dei="(.+?)"/g);
die "Namespace definition must be unique!\n" unless @nsDefs == 1;

my $xpc = XML::LibXML::XPathContext->new($dom);
$xpc->registerNs('dei', $nsDefs[0]);

my @matches = $xpc->findnodes('//dei:TradingSymbol');
print 'Number of matches = ', scalar(@matches), "\n";

Output:

Number of matches = 1
6

There are 6 answers

5
Michael Kay On

Never use regular expressions to process XML: your code will always be wrong. Your example has at least five bugs: it will fail to match if a different prefix is used, it will fail to match if single quotes are used, it will fail to match if there is whitespace around the "=" sign, it will error if the namespace declaration is duplicated, and it will give a spurious match if there is "commented out" XML in the source document.

It is theoretically impossible to eliminate these bugs, because regular expressions are not powerful enough to parse XML correctly.

Always use a real XML parser, and XPath.

0
ikegami On

dei is not a namespace; it's a prefix that's only meaningful in that particular document. You can't count on the namespace's prefix always being dei.

http://xbrl.sec.gov/dei/2014-01-31 is the namespace. That's the thing that can't change, and that you should be basing your code around.

In a comment, you mentioned you have to deal with multiple specs. Just create an XPath prefix for each spec you support.

use strict;
use warnings;

use LWP::Simple               qw( );
use XML::LibXML               qw( );
use XML::LibXML::XPathContext qw( );

my $url = 'https://www.sec.gov/Archives/edgar/data/1057051/000119312517099664/acef-20161231.xml';

my $xml = LWP::Simple::get($url);

my $doc = XML::LibXML->load_xml(string => $xml);

my $xpc = XML::LibXML::XPathContext->new();
$xpc->registerNs( d1 => 'http://xbrl.sec.gov/dei/2012-01-31' );
$xpc->registerNs( d2 => 'http://xbrl.sec.gov/dei/2014-01-31' );

my @matches = $xpc->findnodes('//d1:TradingSymbol|//d2:TradingSymbol', $doc);
print "Number of matches = ", 0+@matches, "\n";
3
Grant McLean On

The only important thing about a namespace in XML is the URI. Your code is assuming a namespace prefix of dei, using that to locate the namespace declaration and determine that the URI is http://xbrl.sec.gov/dei/2014-01-31. This is exactly backwards. The thing you should be hard-coding in your script is the URI - it won't change. The namespace prefix is theoretically variable and a different prefix might be used for the same URI in other documents.

0
Miller On

use getNamespaces()

my @ns_dei = grep { $_->name eq 'xmlns:dei' } $dom->documentElement()->getNamespaces();

die "Namespace definition must be unique!\n" if @ns_dei != 1;

my $xpc = XML::LibXML::XPathContext->new($dom);
$xpc->registerNs( 'dei', $ns_dei[0]->value );
0
Shang Zhang On

Thanks to everyone who answered. I am very inexperienced in terms of using Perl to grab data from Internet (SEC Edgar filings in this particular case). So I am probably not even asking the most intelligent questions.

The business problem (per my best understanding): 1) When a company files its 10K/Q using XBRL, SEC wants the trading symbol information disclosed based on one of SEC's published schemas. 2) The complete list of schema locations is known (and will grow):

-- http://taxonomies.xbrl.us/us-gaap/2009/non-gaap/dei-2009-01-31.xsd
-- https://xbrl.sec.gov/dei/2012/dei-2012-01-31.xsd
-- https://xbrl.sec.gov/dei/2013/dei-2013-01-31.xsd
-- https://xbrl.sec.gov/dei/2014/dei-2014-01-31.xsd

3) I want to grab such trading symbol information.

I now understand that the "dei" namespace-prefix has no real significance. But it seems that even the namespace-name itself e.g. 'http://xbrl.sec.gov/dei/2012-01-31' has no significance. Only the schema location is truly meaningful. Is this correct?

My understanding is that the XBRL instance document references a schema document which "maps" the namespace (e.g. http://xbrl.sec.gov/dei/2012-01-31) to the schema location. (So the namespace-name only needs to be a unique string.)

So is there a way to modify ikegami's code to use the schema locations instead of the namespace names?

Example of a complete XRBL filing: https://www.sec.gov/Archives/edgar/data/1057051/000119312517099664

0
kumesana On

I understand that your problem is that the XML you read will not always use the same URI as namespace to attach to the dei: prefix and the elements you're looking using it.

In that case the XML you're stuck with is ill-designed and there is no good practice established for that. This XML is using namespaces wrong and you will need to work around that. For information, changing an element's namespace is by definition changing its name, and therefore the most basic information you're using to find it.

Your best bet is to ignore namespaces whatsoever. You can do that with

//*[local-name () = "TradingSymbol"]

If the number of different namespaces you can get is limited to a select few, you could instead list them all, as dei: and dei2012: for instance, and select for both:

//dei:TradingSymbol | //dei2012:TradingSymbol