Retrieve html content of a page several seconds after it's loaded

1.2k views Asked by At

I'm coding a script in nodejs to automatically retrieve data from an online directory. Knowing that I had never done this, I chose javascript because it is a language I use every day.

I therefore from the few tips I could find on google use request with cheerios to easily access components of dom of the page. I found and retrieved all the necessary information, the only missing step is to recover the link to the next page except that the one is generated 4 seconds after loading of page and link contains a hash so that this step Is unavoidable.

What I would like to do is to recover dom of page 4-5 seconds after its loading to be able to recover the link

I looked on the internet, and much advice to use PhantomJS for this manipulation, but I can not get it to work after many attempts with node.

This is my code :

#!/usr/bin/env node
require('babel-register');
import request from 'request'
import cheerio from 'cheerio'
import phantom from 'node-phantom'

phantom.create(function(err,ph) {

  return ph.createPage(function(err,page) {

    return page.open(url, function(err,status) {

      console.log("opened site? ", status);
      page.includeJs('http://ajax.googleapis.com/ajax/libs/jquery/1.7.2/jquery.min.js', function(err) {

        //jQuery Loaded.
        //Wait for a bit for AJAX content to load on the page. Here, we are waiting 5 seconds.

        setTimeout(function() {

          return page.evaluate(function() {

            var tt = cheerio.load($this.html())
            console.log(tt)

          }, function(err,result) {

            console.log(result);
            ph.exit();

          });

        }, 5000);

      });
    });
  });
});

but i get this error :

return ph.createPage(function (page) { ^

TypeError: ph.createPage is not a function

Is what I am about to do is the best way to do what I want to do? If not what is the simplest way? If so, where does my error come from?

1

There are 1 answers

3
Hakier On BEST ANSWER

If You dont have to use phantomjs You can use nightmare to do it.

It is pretty neat library to solve problems like yours, it uses electron as web browser and You can run it with or without showing window (You can also open developer tools like in Google Chrome)

It has only one flaw if You want to run it on server without graphical interface that You must install at least framebuffer.

Nightmare has method like wait(cssSelector) that will wait until some element appears on website.

Your code would be something like:

const Nightmare = require('nightmare');
const nightmare = Nightmare({
    show: true, // will show browser window
    openDevTools: true // will open dev tools in browser window 
});

const url = 'http://hakier.pl';
const selector = '#someElementSelectorWitchWillAppearAfterSomeDelay';

nightmare
        .goto(url)
        .wait(selector)
        .evaluate(selector => {
    return {
        nextPage: document.querySelector(selector).getAttribute('href')
    };
}, selector)
.then(extracted => {
    console.log(extracted.nextPage); //Your extracted data from evaluate
});
//this variable will be injected into evaluate callback
//it is required to inject required variables like this,
// because You have different - browser scope inside this
// callback and You will not has access to node.js variables not injected 

Happy hacking!