Cannot download full Document using HtmlUnit and Jsoup combination (using Java)

1.8k views Asked by At

Problem Statement: I want to crawl this page : http://www.hongkonghomes.com/en/property/rent/the_peak/middle_gap_road/10305?page_no=1&rec_per_page=12&order=rental+desc&offset=0

Lets say I want to parse the address, that is "24, Middle Gap Road, The Peak, Hong Kong"

What I did: I first only tried to load using jsoup, but then I noticed that the page is taking some time to load. So, then I also plugged in HTMLUnit to wait for the page to load first

Code I wrote:

public static void parseByHtmlUnit() throws Exception{
        String url = "http://www.hongkonghomes.com/en/property/rent/the_peak/middle_gap_road/10305?page_no=1&rec_per_page=12&order=rental+desc&offset=0";
        WebClient webClient = new WebClient(BrowserVersion.FIREFOX_38);
        webClient.waitForBackgroundJavaScriptStartingBefore(30000);
        HtmlPage page = webClient.getPage(url);
        synchronized(page) {
            page.wait(30000);
        }
        try {
            Document doc = Jsoup.parse(page.asXml());
            String address = ElementsUtil.getTextOrEmpty(doc.select(".addr"));
            System.out.println("address"+address);
        } catch (Exception e) {
             e.printStackTrace();
        }
}

Expected output : In the console, I should get this output: address 24, Middle Gap Road, The Peak, Hong Kong

Actual output : address

1

There are 1 answers

0
user3707125 On BEST ANSWER

How about this?

final Document document = Jsoup.parse(
    new URL("http://www.hongkonghomes.com/en/property/rent/the_peak/middle_gap_road/10305?page_no=1&rec_per_page=12&order=rental+desc&offset=0"),
    30000
);
System.out.println(document.select(".addr").text());