XPath expressions for extracting information from AWIS (Alexa.com) XML data

339 views Asked by At

I somehow can't manage to extract information from AWIS results (containing Alexa data).

I've a bunch of XML files containing AWIS data from which I want to extract information bits such as Rank and PageViews for 3 month period.

The two (colliding) namespaces are somehow confusing and my XPath expressions are not working as intended. (Even a simple //aws:Rank/text() is not working.)

It would be great if somebody could assist me to get going.

Currently, I am using jdom library, but wouldn't mind using something else. This is a tiny example that does not work as suspected:

Document doc = new SAXBuilder().build(file);
XPath xpath = XPath.newInstance("//aws:Rank");
xpath.addNamespace("aws", "http://awis.amazonaws.com/doc/2005-07-11/");
Element rank = (Element) xpath.selectSingleNode(doc);

I'd prefer to use javax.xml though...

Here's an example of the XML:

<?xml version="1.0"?>
<aws:UrlInfoResponse xmlns:aws="http://alexa.amazonaws.com/doc/2005-10-05/">
<aws:Response xmlns:aws="http://awis.amazonaws.com/doc/2005-07-11">
<aws:OperationRequest>
<aws:RequestId>XXXX-XXXX-XXXX-XXXX-XXXX</aws:RequestId>
</aws:OperationRequest>
<aws:UrlInfoResult>
<aws:Alexa>

  <aws:ContactInfo>
    <aws:DataUrl type="canonical">ahparis.com</aws:DataUrl>
    <aws:PhoneNumbers>
      <aws:PhoneNumber>+33 140289796</aws:PhoneNumber>
    </aws:PhoneNumbers>
    <aws:OwnerName>John Fay</aws:OwnerName>
    <aws:Email>[email protected]</aws:Email>
    <aws:PhysicalAddress>
      <aws:Streets>
        <aws:Street>22 rue Saint Sauveur</aws:Street>
      </aws:Streets>
      <aws:City>Paris 75002,</aws:City>
      <aws:Country>FRANCE</aws:Country>
    </aws:PhysicalAddress>
    <aws:CompanyStockTicker/>
  </aws:ContactInfo>
  <aws:ContentData>
    <aws:DataUrl type="canonical">ahparis.com</aws:DataUrl>
    <aws:SiteData>
      <aws:Title>Ah Paris</aws:Title>
      <aws:Description>Short term apartment rentals. Search engine, descriptions, photos, rates.</aws:Description>
      <aws:OnlineSince>26-Feb-2003</aws:OnlineSince>
    </aws:SiteData>
    <aws:Keywords>
      <aws:Keyword>FranĖ¤ais</aws:Keyword>
      <aws:Keyword>Ile-de-France</aws:Keyword>
    </aws:Keywords>
    <aws:OwnedDomains>
      <aws:OwnedDomain>
        <aws:Domain>paris-tournament.org</aws:Domain>
        <aws:Title>paris-tournament.org</aws:Title>
      </aws:OwnedDomain>
    </aws:OwnedDomains>
  </aws:ContentData>
  <aws:TrafficData>
    <aws:DataUrl type="canonical">ahparis.com</aws:DataUrl>
    <aws:Rank>2547606</aws:Rank>
    <aws:RankByCountry/>
    <aws:RankByCity/>
    <aws:UsageStatistics>
      <aws:UsageStatistic>
        <aws:TimeRange>
          <aws:Months>3</aws:Months>
        </aws:TimeRange>
        <aws:Rank>
          <aws:Value>2547606</aws:Value>
          <aws:Delta>-658661</aws:Delta>
        </aws:Rank>
        <aws:Reach>
          <aws:Rank>
            <aws:Value>2964984</aws:Value>
            <aws:Delta>-152875</aws:Delta>
          </aws:Rank>
          <aws:PerMillion>
            <aws:Value>0.28</aws:Value>
            <aws:Delta>-10.64%</aws:Delta>
          </aws:PerMillion>
        </aws:Reach>
        <aws:PageViews>
          <aws:PerMillion>
            <aws:Value>0.01</aws:Value>
            <aws:Delta>+100%</aws:Delta>
          </aws:PerMillion>
          <aws:Rank>
            <aws:Value>2143379</aws:Value>
            <aws:Delta>-1628449</aws:Delta>
          </aws:Rank>
          <aws:PerUser>
            <aws:Value>4.0</aws:Value>
            <aws:Delta>+120%</aws:Delta>
          </aws:PerUser>
        </aws:PageViews>
      </aws:UsageStatistic>
      <aws:UsageStatistic>
        <aws:TimeRange>
          <aws:Months>1</aws:Months>
        </aws:TimeRange>
        <aws:Rank>
          <aws:Value>1430628</aws:Value>
          <aws:Delta>-3224794</aws:Delta>
        </aws:Rank>
        <aws:Reach>
          <aws:Rank>
            <aws:Value>1656655</aws:Value>
            <aws:Delta>-5103474</aws:Delta>
          </aws:Rank>
          <aws:PerMillion>
            <aws:Value>0.5</aws:Value>
            <aws:Delta>+500%</aws:Delta>
          </aws:PerMillion>
        </aws:Reach>
        <aws:PageViews>
          <aws:PerMillion>
            <aws:Value>0.02</aws:Value>
            <aws:Delta>+100%</aws:Delta>
          </aws:PerMillion>
          <aws:Rank>
            <aws:Value>1279227</aws:Value>
            <aws:Delta>-859817</aws:Delta>
          </aws:Rank>
          <aws:PerUser>
            <aws:Value>4</aws:Value>
            <aws:Delta>-63.11%</aws:Delta>
          </aws:PerUser>
        </aws:PageViews>
      </aws:UsageStatistic>
      <aws:UsageStatistic>
        <aws:TimeRange>
          <aws:Days>7</aws:Days>
        </aws:TimeRange>
        <aws:Rank>
          <aws:Value>1927968</aws:Value>
          <aws:Delta>+757770</aws:Delta>
        </aws:Rank>
        <aws:Reach>
          <aws:Rank>
            <aws:Value>2942088</aws:Value>
            <aws:Delta>+1612570</aws:Delta>
          </aws:Rank>
          <aws:PerMillion>
            <aws:Value>0.3</aws:Value>
            <aws:Delta>-64.64%</aws:Delta>
          </aws:PerMillion>
        </aws:Reach>
        <aws:PageViews>
          <aws:PerMillion>
            <aws:Value>0.05</aws:Value>
            <aws:Delta>+80%</aws:Delta>
          </aws:PerMillion>
          <aws:Rank>
            <aws:Value>708394</aws:Value>
            <aws:Delta>-413955</aws:Delta>
          </aws:Rank>
          <aws:PerUser>
            <aws:Value>10</aws:Value>
            <aws:Delta>+400%</aws:Delta>
          </aws:PerUser>
        </aws:PageViews>
      </aws:UsageStatistic>
    </aws:UsageStatistics>
    <aws:ContributingSubdomains>
      <aws:ContributingSubdomain>
        <aws:DataUrl>ahparis.com</aws:DataUrl>
        <aws:TimeRange>
          <aws:Months>1</aws:Months>
        </aws:TimeRange>
        <aws:Reach>
          <aws:Percentage>100.00%</aws:Percentage>
        </aws:Reach>
        <aws:PageViews>
          <aws:Percentage>100.00%</aws:Percentage>
          <aws:PerUser>4</aws:PerUser>
        </aws:PageViews>
      </aws:ContributingSubdomain>
    </aws:ContributingSubdomains>
  </aws:TrafficData>
</aws:Alexa>
</aws:UrlInfoResult>
<aws:ResponseStatus xmlns:aws="http://alexa.amazonaws.com/doc/2005-10-05/">
<aws:StatusCode>Success</aws:StatusCode>
</aws:ResponseStatus>
</aws:Response>
</aws:UrlInfoResponse>
3

There are 3 answers

1
Ian Roberts On BEST ANSWER

It looks like a typo in the namespace URI - your code has

xpath.addNamespace("aws", "http://awis.amazonaws.com/doc/2005-07-11/");

(with a trailing slash) but the document has

xmlns:aws="http://awis.amazonaws.com/doc/2005-07-11"

(without the slash).

I'd prefer to use javax.xml though...

Namespace handling is a real pain in javax.xml.xpath, because there's no default implementation of the NamespaceContext interface provided in the Java class library. You have to either implement your own or use a third-party implementation (I usually go for the SimpleNamespaceContext from Spring). If you're going to be doing a lot of XPath manipulation I'd suggest looking at Saxon 9 (the HE version is free of charge) and use its s9api, as this supports the far more powerful version 2.0 of the XPath language.

1
rolfl On

You hve a typo in your code. You have:

xpath.addNamespace("aws", "http://aws.amazonaws.com/doc/2005-07-11/");

but you should have:

xpath.addNamespace("aws", "http://awis.amazonaws.com/doc/2005-07-11/");

(note the change from aws to awis).

Additionally, you should really be using JDOM 2.5, and the new XPath API that was introduced there. The JDOM 2.x versions have significantly better handling for namespaces, and generics on the resulting content. See The changes in JDOM2.x XPath handling

0
Joel M. Lamsen On

I tried this using your input with xslt with the following stylesheet:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:alex="http://alexa.amazonaws.com/doc/2005-10-05/"
    xmlns:awis="http://awis.amazonaws.com/doc/2005-07-11"
    version="1.0">

    <xsl:output omit-xml-declaration="yes"/>

    <xsl:template match="/">
        <xsl:value-of select="//awis:Rank/text()"/>
    </xsl:template>

</xsl:stylesheet>

and somehow I got an output of:

2547606

I suppose you have to register the namespaces in different prefixes, then use that in your xpath