Getting Exception on evaluating an XPath expression in Java

738 views Asked by At

I am trying to learn the usage of Xpath expressions with Java. I am using Jtidy to convert the HTML page to XHTML so that I can easily parse it using XPath expressions. I have the following code:

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(true);


DocumentBuilder builder = factory.newDocumentBuilder();
    Document doc = ConvertXHTML("https://twitter.com/?lang=fr");

//Create XPath

XPathFactory xpathfactory = XPathFactory.newInstance();
XPath Inst= xpathfactory.newXPath();
NodeList nodes = (NodeList)Inst.evaluate("//p/@align",doc,XPathConstants.NODESET);
    for (int i = 0; i < nodes.getLength(); ++i) 
   {
            Element e = (Element) nodes.item(i);
            System.out.println(e);
    }

public Document ConvertXHTML(String link){
  try{

      URL u = new URL(link);

     BufferedInputStream instream=new BufferedInputStream(u.openStream());
     FileOutputStream outstream=new FileOutputStream("out.xhtml");

     Tidy c=new Tidy();
     c.setShowWarnings(false);
     c.setInputEncoding("UTF-8");
     c.setOutputEncoding("UTF-8");
     c.setXHTML(true);

     return c.parseDOM(instream,outstream);
     }

It's working fine for most URLs but this one :

https://twitter.com/?lang=fr

I am getting this exception because of it:

javax.xml.transform.TransformerException: Index -1 out of bounds.....

Below is a part of stack trace I am getting:

javax.xml.transform.TransformerException: Index -1 out of bounds for length 128
at java.xml/com.sun.org.apache.xpath.internal.XPath.execute(XPath.java:366)
at java.xml/com.sun.org.apache.xpath.internal.XPath.execute(XPath.java:303)
at java.xml/com.sun.org.apache.xpath.internal.jaxp.XPathImplUtil.eval(XPathImplUtil.java:101)
at java.xml/com.sun.org.apache.xpath.internal.jaxp.XPathExpressionImpl.eval(XPathExpressionImpl.java:80)
at java.xml/com.sun.org.apache.xpath.internal.jaxp.XPathExpressionImpl.evaluate(XPathExpressionImpl.java:89)
at files.ExampleCode.GetThoselinks(ExampleCode.java:50)
at files.ExampleCode.DoSomething(ExampleCode.java:113)
at files.ExampleCode.GetThoselinks(ExampleCode.java:81)
at files.ExampleCode.DoSomething(ExampleCode.java:113)

I am not sure whether the problem is in the converted xhtml of the website or something else. Can anyone tell what is wrong in the code? Any edits would be helpful.

2

There are 2 answers

2
Michael Kay On

I would normally say that an index-of-bounds exception coming from deep within the XPath engine is a bug in the XPath engine. The only caveat is if there's something structurally wrong with the DOM that the XPath engine is searching; an XPath processor is entitled to make reasonable assumptions that the DOM is valid and to crash if it isn't. In that case it would be a bug in Tidy, which created the DOM.

0
user3969107 On

I had a similar problem using xpath evaluation on a document produced by JTidy. I got around it by having JTidy serialize the DOM it produced to a file, and then parsing that xml file with javax.xml.parsers.DocumentBuilder to get a 2nd version of the DOM. Bizarre as it seems, using the 2nd one avoided the out of bounds exception and worked. Use code like the following:

        DocumentBuilderFactory documentBuilderFactory = DocumentBuilderFactory.newInstance();
        documentBuilderFactory.setNamespaceAware(true);
        // If you don't do the following, it will take a full minute to parse the xml document (presumably the time-out
        // period for trying to load the DTD). See https://stackoverflow.com/questions/6204827/xml-parsing-too-slow.
        documentBuilderFactory.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);
        documentBuilder = documentBuilderFactory.newDocumentBuilder();
        Document doc = tidy.parseDOM(input, null);
        FileOutputStream fos = new FileOutputStream("temp.xml");
        tidy.pprint(doc, fos);
        fos.close();
        doc = documentBuilder.parse("temp.xml");