The script below works. It parses a XML and looks up a particular node under the namespace "dei".
But is relying on regex for the namespace definition the proper way? (I do not really know XML. So I worry that such regex is not fool-proof for all Edgar XMLs. For example -- are such definitions always enclosed in double quotes and preceded by xmlns: ?)
Thanks.
use strict;
use warnings;
use LWP::Simple;
use XML::LibXML;
use XML::LibXML::XPathContext;
my $url = 'https://www.sec.gov/Archives/edgar/data/1057051/000119312517099664/acef-20161231.xml';
my $xml = LWP::Simple::get($url);
my $dom = XML::LibXML->load_xml(string => $xml);
my @nsDefs = ($xml =~ /xmlns:dei="(.+?)"/g);
die "Namespace definition must be unique!\n" unless @nsDefs == 1;
my $xpc = XML::LibXML::XPathContext->new($dom);
$xpc->registerNs('dei', $nsDefs[0]);
my @matches = $xpc->findnodes('//dei:TradingSymbol');
print 'Number of matches = ', scalar(@matches), "\n";
Output:
Number of matches = 1
Never use regular expressions to process XML: your code will always be wrong. Your example has at least five bugs: it will fail to match if a different prefix is used, it will fail to match if single quotes are used, it will fail to match if there is whitespace around the "=" sign, it will error if the namespace declaration is duplicated, and it will give a spurious match if there is "commented out" XML in the source document.
It is theoretically impossible to eliminate these bugs, because regular expressions are not powerful enough to parse XML correctly.
Always use a real XML parser, and XPath.