Cannot get all matched nodes while using htmlparser to parse a website

107 views Asked by At

I'm using htmlparser for parsing a website, but I've trapped in a really weird problem. I'm trying to get all <li> nodes at a webpage and my code is such as:

String url = "http://s.1688.com/selloffer/offer_search.htm?keywords=%BD%A8%B2%C4&n=y&categoryId=";
Parser parser = new Parser(url);
parser.setEncoding("gb2312");

NodeList list = parser.extractAllNodesThatMatch(new TagNameFilter("li"));
// NodeList list = parser.parse(new CssSelectorNodeFilter("li[class=\"sm-offerShopwindow\"]"));
System.out.print(list.size() + "\n");
for (int i = 0; i < list.size(); i++) {
Node li = list.elementAt(i);
System.out.print("text:" + li.getText() + "\n");
}

But the output of list size is always 20. It seems that it doesn't travel all nodes on that page. Why? Thanks for any advices.

1

There are 1 answers

1
Harald On

Even the top browsers around do not always agree on how to parse all that weird stuff out there pretending to be HTML and, the web very much developed since 2006. So I would not be surprised if such an old piece of software cannot cope with modern HTML.