Need some help with XPath expression. One works, the other doesn't

1.9k views Asked by At

I'm using the COBRA HTMLParser but haven't had luck parsing one particular tag. Here's the source:

<li id="eta" class="hentry">
  <span class="body">
    <span class="actions">
    </span>
    <span class="content">
    </span>
    <span class="meta entry">Content here
    </span>
    <span class="meta entry stub">Content here
    <span class="shared-content">
      Information by
      <a class="title" data="associate" href="/associate">Associate</a>
    </span>
    </span>
  </span>
</li>

I am able to use the following XPaths to get the proper information:

            XPath xpath = XPathFactory.newInstance().newXPath();
            NodeList nodeList = (NodeList) xpath.evaluate("//span[contains(@class, 'body')]", document, XPathConstants.NODESET);
            int length = nodeList.getLength();
            System.out.println(nodeList.getLength());
            for(int i = 0; i < length; i++) {
                Element element = (Element) nodeList.item(i);
                NodeList n = null;
                try {
                    n = (NodeList) xpath.evaluate("span[contains(@class, 'content')]", element, XPathConstants.NODESET);
                    String body = n.item(0).getTextContent();
                    System.out.println("Content: " + body);
                } catch (Exception e) {};

                try {

                    String date = (String) xpath.evaluate("span[contains(@class, 'meta entry')]/a/span/@data", element, XPathConstants.STRING);
                    System.out.println("DATA: " + date);

                    String source = (String) xpath.evaluate("//span[contains(@class, 'meta entry')]/span", element, XPathConstants.STRING);
                    System.out.println("DATA: " + source);

                } catch (Exception e) {};

                //This does not work at all! I've tried every combination and still can't get it to run
                try {
                    String info = (String) xpath.evaluate("//span[@class='shared-content']/a/@data", element, XPathConstants.STRING);
                    System.out.println("INFO: " + info);
                } catch (Exception e) {};

            }

The last expression does not work whatever combination I try. I've tried the following too but it doesn't help,

        String info = (String) xpath.evaluate("//span[contains(@class, 'shared-content')]/a/@data", element, XPathConstants.STRING);
        String info = (String) xpath.evaluate("//span[contains(@class, 'meta entry info')]/span/a/@data", element, XPathConstants.STRING);

Any suggestions?

EDIT: There have been a couple of suggestions about the XML being illegal (which honestly I am not sure myself as to why it is illegal because I've seen it almost everywhere till now) but I don't have control over the XML though (at least until Monday till my other pals get back). I am trying to see the feasibility of writing a mashup including this information. Is there someway to disable checking or something?

Here's the XML that was parsed:

       <?xml version="1.0" encoding="UTF-8"?>
          <span class="body">
            <span class="content">TextContent</span>
            <span class="meta entry">TextContent</span>

          </span>

I guess the document is not getting parsed correctly.

4

There are 4 answers

0
jutky On

@Jherico,@Andrew Keith I don't know the COBRA HTMLParser, but combining #PCDATA with inner nodes is a legal XML format.
This could be defined like this in the DTD:

<!ELEMENT text_node     (#PCDATA|i|b|u)*>

This is the way a well-formatted HTML is still a legal XML.

4
Jherico On

I ran the following code

public static void main(String[] args) throws SAXException, IOException, ParserConfigurationException, XPathExpressionException {
    Document doc = XmlUtil.parseXmlResource("/temp.xml");
    for (Node n : XPathUtil.getNodes(doc, "//span[contains(@class, 'body')]")) {
        System.out.println(XPathUtil.getStringValue(doc, "//span[@class='shared-content']/a/@data"));
    }
}

And it output 'associate'. I think your XPath is fine. What is happening instead? And can you remove the empty catch blocks so we can see if you're actually getting exceptions?

Note, XmlUtil and XPathUtil are my own personal convenience functions to eliminate most of the XPath and XML boilerplate code.

0
Cheeso On

XPathVisualizer is a nice XPath Visualizer tool, runs on Windows, lets you see the results of your XPath queries. Xcopy install, a single EXE file. Free.

I took it and ran your query in it, got this result:

alt text

2
jitter On

I just ran your code sample as is (copy paste) and got this output. So everything seems fine. (which cobra version are you using? Me 0.98.4)

1
Content:

DATA:
DATA:
      Information by
      Associate

INFO: associate

Reproducible test(?)

  • Using javac/java version 1.6.0_16 (HotSpot Client: build 14.2-b01, mixed mode, sharing)
  • I downloaded 0.98.4 (cobra-0.98.4.zip) from here Sourceforge: Cobra HTML Toolkit download
  • Extracted js.jar and cobra.jar from the cobra-0.98.4.zip:\lib to a directory XXX
  • Wrote XMLTest.java and HTMLTest.java in same directory (!filenames are links to source)
  • Ran this to compile (windows): javac -cp .;cobra.jar;js.jar *.java
  • Then executed like this (output included)

XMLTest

java -cp .;cobra.jar;js.jar XMLTest 1

XMLTest Output:

1
Content:

DATA:
DATA:
      Information by
      Associate

INFO: associate 

HTMLTest

java -cp .;cobra.jar;js.jar HTMLTest 1

HTMLTest Output:

1
Content:

DATA:
DATA:
      Information by
      Associate

INFO: associate