how to parse the html when using crawler4j

2.4k views Asked by At

Recently,I had to crawl some website with open Source project crawler4j.However,crawler4j didn't offer any api for using.Now,i came to a problem that how i can parse a html with the function and class provided by crawler4j and find element like we do with jquery

1

There are 1 answers

0
vigneshwerv On

It's relatively simple. The following approach worked for me.

In MyCrawler.java:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
...
public void visit(Page page) {
...
if (page.getParseData() instanceof HtmlParseData) {
                    HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
                    String html = htmlParseData.getHtml();
                    Document doc = Jsoup.parseBodyFragment(html);
...