I somehow can't manage to extract information from AWIS results (containing Alexa data).
I've a bunch of XML
files containing AWIS data from which I want to extract information bits such as Rank and PageViews for 3 month period.
The two (colliding) namespaces are somehow confusing and my XPath
expressions are not working as intended. (Even a simple //aws:Rank/text()
is not working.)
It would be great if somebody could assist me to get going.
Currently, I am using jdom
library, but wouldn't mind using something else. This is a tiny example that does not work as suspected:
Document doc = new SAXBuilder().build(file);
XPath xpath = XPath.newInstance("//aws:Rank");
xpath.addNamespace("aws", "http://awis.amazonaws.com/doc/2005-07-11/");
Element rank = (Element) xpath.selectSingleNode(doc);
I'd prefer to use javax.xml
though...
Here's an example of the XML
:
<?xml version="1.0"?>
<aws:UrlInfoResponse xmlns:aws="http://alexa.amazonaws.com/doc/2005-10-05/">
<aws:Response xmlns:aws="http://awis.amazonaws.com/doc/2005-07-11">
<aws:OperationRequest>
<aws:RequestId>XXXX-XXXX-XXXX-XXXX-XXXX</aws:RequestId>
</aws:OperationRequest>
<aws:UrlInfoResult>
<aws:Alexa>
<aws:ContactInfo>
<aws:DataUrl type="canonical">ahparis.com</aws:DataUrl>
<aws:PhoneNumbers>
<aws:PhoneNumber>+33 140289796</aws:PhoneNumber>
</aws:PhoneNumbers>
<aws:OwnerName>John Fay</aws:OwnerName>
<aws:Email>[email protected]</aws:Email>
<aws:PhysicalAddress>
<aws:Streets>
<aws:Street>22 rue Saint Sauveur</aws:Street>
</aws:Streets>
<aws:City>Paris 75002,</aws:City>
<aws:Country>FRANCE</aws:Country>
</aws:PhysicalAddress>
<aws:CompanyStockTicker/>
</aws:ContactInfo>
<aws:ContentData>
<aws:DataUrl type="canonical">ahparis.com</aws:DataUrl>
<aws:SiteData>
<aws:Title>Ah Paris</aws:Title>
<aws:Description>Short term apartment rentals. Search engine, descriptions, photos, rates.</aws:Description>
<aws:OnlineSince>26-Feb-2003</aws:OnlineSince>
</aws:SiteData>
<aws:Keywords>
<aws:Keyword>FranĖ¤ais</aws:Keyword>
<aws:Keyword>Ile-de-France</aws:Keyword>
</aws:Keywords>
<aws:OwnedDomains>
<aws:OwnedDomain>
<aws:Domain>paris-tournament.org</aws:Domain>
<aws:Title>paris-tournament.org</aws:Title>
</aws:OwnedDomain>
</aws:OwnedDomains>
</aws:ContentData>
<aws:TrafficData>
<aws:DataUrl type="canonical">ahparis.com</aws:DataUrl>
<aws:Rank>2547606</aws:Rank>
<aws:RankByCountry/>
<aws:RankByCity/>
<aws:UsageStatistics>
<aws:UsageStatistic>
<aws:TimeRange>
<aws:Months>3</aws:Months>
</aws:TimeRange>
<aws:Rank>
<aws:Value>2547606</aws:Value>
<aws:Delta>-658661</aws:Delta>
</aws:Rank>
<aws:Reach>
<aws:Rank>
<aws:Value>2964984</aws:Value>
<aws:Delta>-152875</aws:Delta>
</aws:Rank>
<aws:PerMillion>
<aws:Value>0.28</aws:Value>
<aws:Delta>-10.64%</aws:Delta>
</aws:PerMillion>
</aws:Reach>
<aws:PageViews>
<aws:PerMillion>
<aws:Value>0.01</aws:Value>
<aws:Delta>+100%</aws:Delta>
</aws:PerMillion>
<aws:Rank>
<aws:Value>2143379</aws:Value>
<aws:Delta>-1628449</aws:Delta>
</aws:Rank>
<aws:PerUser>
<aws:Value>4.0</aws:Value>
<aws:Delta>+120%</aws:Delta>
</aws:PerUser>
</aws:PageViews>
</aws:UsageStatistic>
<aws:UsageStatistic>
<aws:TimeRange>
<aws:Months>1</aws:Months>
</aws:TimeRange>
<aws:Rank>
<aws:Value>1430628</aws:Value>
<aws:Delta>-3224794</aws:Delta>
</aws:Rank>
<aws:Reach>
<aws:Rank>
<aws:Value>1656655</aws:Value>
<aws:Delta>-5103474</aws:Delta>
</aws:Rank>
<aws:PerMillion>
<aws:Value>0.5</aws:Value>
<aws:Delta>+500%</aws:Delta>
</aws:PerMillion>
</aws:Reach>
<aws:PageViews>
<aws:PerMillion>
<aws:Value>0.02</aws:Value>
<aws:Delta>+100%</aws:Delta>
</aws:PerMillion>
<aws:Rank>
<aws:Value>1279227</aws:Value>
<aws:Delta>-859817</aws:Delta>
</aws:Rank>
<aws:PerUser>
<aws:Value>4</aws:Value>
<aws:Delta>-63.11%</aws:Delta>
</aws:PerUser>
</aws:PageViews>
</aws:UsageStatistic>
<aws:UsageStatistic>
<aws:TimeRange>
<aws:Days>7</aws:Days>
</aws:TimeRange>
<aws:Rank>
<aws:Value>1927968</aws:Value>
<aws:Delta>+757770</aws:Delta>
</aws:Rank>
<aws:Reach>
<aws:Rank>
<aws:Value>2942088</aws:Value>
<aws:Delta>+1612570</aws:Delta>
</aws:Rank>
<aws:PerMillion>
<aws:Value>0.3</aws:Value>
<aws:Delta>-64.64%</aws:Delta>
</aws:PerMillion>
</aws:Reach>
<aws:PageViews>
<aws:PerMillion>
<aws:Value>0.05</aws:Value>
<aws:Delta>+80%</aws:Delta>
</aws:PerMillion>
<aws:Rank>
<aws:Value>708394</aws:Value>
<aws:Delta>-413955</aws:Delta>
</aws:Rank>
<aws:PerUser>
<aws:Value>10</aws:Value>
<aws:Delta>+400%</aws:Delta>
</aws:PerUser>
</aws:PageViews>
</aws:UsageStatistic>
</aws:UsageStatistics>
<aws:ContributingSubdomains>
<aws:ContributingSubdomain>
<aws:DataUrl>ahparis.com</aws:DataUrl>
<aws:TimeRange>
<aws:Months>1</aws:Months>
</aws:TimeRange>
<aws:Reach>
<aws:Percentage>100.00%</aws:Percentage>
</aws:Reach>
<aws:PageViews>
<aws:Percentage>100.00%</aws:Percentage>
<aws:PerUser>4</aws:PerUser>
</aws:PageViews>
</aws:ContributingSubdomain>
</aws:ContributingSubdomains>
</aws:TrafficData>
</aws:Alexa>
</aws:UrlInfoResult>
<aws:ResponseStatus xmlns:aws="http://alexa.amazonaws.com/doc/2005-10-05/">
<aws:StatusCode>Success</aws:StatusCode>
</aws:ResponseStatus>
</aws:Response>
</aws:UrlInfoResponse>
It looks like a typo in the namespace URI - your code has
(with a trailing slash) but the document has
(without the slash).
Namespace handling is a real pain in
javax.xml.xpath
, because there's no default implementation of theNamespaceContext
interface provided in the Java class library. You have to either implement your own or use a third-party implementation (I usually go for theSimpleNamespaceContext
from Spring). If you're going to be doing a lot of XPath manipulation I'd suggest looking at Saxon 9 (the HE version is free of charge) and use its s9api, as this supports the far more powerful version 2.0 of the XPath language.