I am trying to extract data for a class project from a webpage (a page that shows search results). Specifically, it's this page:
I just want to extract the titles of the products.
I'm using the following code:
final WebClient webClient = new WebClient(BrowserVersion.CHROME);
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getOptions().setJavaScriptEnabled(true);
webClient.setAjaxController(new NicelyResynchronizingAjaxController());
final HtmlPage page = webClient.getPage(itemPageURL);
int tries = 20; // Amount of tries to avoid infinite loop
while (tries > 0) {
tries--;
synchronized(page) {
page.wait(2000); // How often to check
}
}
int numThreads = webClient.waitForBackgroundJavaScript(1000000l);
PrintWriter pw = new PrintWriter("test-target-search.txt");
pw.println(page.asXml());
pw.close();
The page that results does not have the product information that's shown on the web browser. I imagine the AJAX calls haven't completed? (not sure though.)
Any help would greatly be appreciated. Thanks!
You can use GET requests for such task. Control the page by the "pageCount" and "offset" argument in the URL, after retrieving the page (the example below does this for one page) you can use regex or whatever the content is in (JSON?) to extract the titles.