Parsing using HTMLParser

Asked by At
Parser parser = new Parser();
    parser.setInputHTML("d:/index.html");
    parser.setEncoding("UTF-8");
    NodeList nl = parser.parse(null); 
    /*
    SimpleNodeIterator sNI=list.elements();
    while(sNI.hasMoreNodes()){
    System.out.println(sNI.nextNode().getText());}
    */
    NodeList trs = nl.extractAllNodesThatMatch(new TagNameFilter("tr"),true);
    for(int i=0;i<trs.size();i++) {
        NodeList nodes = trs.elementAt(i).getChildren();
        NodeList tds  = nodes.extractAllNodesThatMatch(new TagNameFilter("td"),true);
    System.out.println(tds.toString());

I am not getting any output, eclipse shows javaw.exe terminated.

1 Answers

0
Sahil Muthoo On

Pass the path to the resource into the constructor.

Parser parser = new Parser("index.html");

Parse and print all the divs on this page:

Parser parser = new Parser("http://stackoverflow.com/questions/7293729/parsing-using-htmlparser/");
parser.setEncoding("UTF-8");
NodeList nl = parser.parse(null);
NodeList div = nl.extractAllNodesThatMatch(new TagNameFilter("div"),true);
System.out.println(div.toString());

parser.setInputHtml(String inputHtml) doesn't do what you think it does. It treats inputHtml as the html input to the parser. You use the constructor to point the parser at an html resource (file or URL).

Example:

Parser parser = new Parser();
parser.setInputHTML("<div>Foo</div><div>Bar</div>");