I use HtmlCleaner 2.6.1 and Xpath to parse html page in Android application. Here html page:
http://www.kino-govno.com/comments/42571-postery-kapitan-fillips-i-poslednij-rubezh
http://www.kino-govno.com/comments/42592-fantasticheskie-idei-i-mesta-ih-obitanija
The first link return document, is all right.The second link here in this place:
document = domSerializer.createDOM(tagNode);
returns nothing.
If you create a simple java project without android. That all works fine.
Here is the Code :
String queries = "//div[starts-with(@class, 'news_text op')]/p";
URL url = new URL(link2);
TagNode tagNode = new HtmlCleaner().clean(url);
CleanerProperties cleanerProperties = new CleanerProperties();
DomSerializer domSerializer = new DomSerializer(cleanerProperties);
document = domSerializer.createDOM(tagNode);
xPath = XPathFactory.newInstance().newXPath();
pageNode = (NodeList)xPath.evaluate(queries,document, XPathConstants.NODESET);
String val = pageNode.item(0).getFirstChild().getNodeValue();
That's because HtmlCleaner wraps the paragraphs of the second HTML page into another
<div/>
, so it is not a direct child any more. Use thedescendent-or-self
-axis//
instead of thechild
-axis/
: