Make NodeJS/JSDom wait for full rendering before scraping

2.5k views Asked by At

I'm trying to scrape data from a website that I need to log into. Unfortunately, I'm getting different results using JSDom/NodeJS than I would if I were to use a web browser, such as FF. In particular, I'm not getting the log in form with the username, password and submit button.

I understand much of Javascript, at least, is asynchronous. However, I thought the "done" function of JSDom waits synchronously for the full rendering of the page. I guess what I'd like to do is simulate an HTTPS get and wait for the full document.ready to be done.

var jsdom = require("jsdom");
var jsdom_global = require("jsdom-global");
var fs = require("fs");
var jquery = fs.readFileSync("./jquery-3.1.1.min.js", "utf-8");

jsdom.env({
  url: "https://wemc.smarthub.coop/Login.html#login:",
  src: [jquery],
  done: function (err, window) {
    var $ = window.$;
    if($("button#LoginSubmitButton").length) {
        console.log('Click button found');
    } else {
        console.log('Click button not found');
    }
    // The following text boxes are not coming back:
    // $("input#LoginUsernameTextBox")
    // $("input#LoginPasswordTextBox")

    // If I enable the line below, I see a lot less than I would if I
    // do a view source in any reasonable browser.
    //console.log($("body").html());


  }
});
1

There are 1 answers

0
Pyx On

Usually, this will happen because JSDOM doesn't execute the JS when it hits the page. In that case, the only elements returned will be the server rendered HTML.

You could try a headless browser module such as PhantomJS etc and see how that goes for you. There's a section about the distinction between the two at the bottom of the JSDOM github page.