How do you access the unaltered source of a page via phantomjs

147 views Asked by At

Using phantomjs, it's possible to get access to a copy of the modified DOM, post-parsing. Using a cURL call you can get access to the page pre-parsing. In the pre-parsed code, you may find errors which are corrected by a browser.

How do you get access to both the post-rendered changes and the pre-rendered content to make a comparison of the fixes the browser does automatically?

Is the best method to use DIFF on the two files or does phantomjs hold two copies of the content, the original and the modified forms? I can't seem to find the right way to phrase this to get an answer via google and a search here: https://stackoverflow.com/search?q=[phantomjs]+save+unaltered+source didn't turn up any results.

I'd like to avoid a second call to the same page for bandwidth/efficiency reasons.

1

There are 1 answers

1
Artjom B. On BEST ANSWER

There is no way to directly access the unaltered source (referred to as view-source in other browsers) in PhantomJS.

You could try to read the page from the PhantomJS cache (when run with the --disk-cache=true option), but there is an easier method. You can simply sent an AJAX request to get the source "on the wire", but then you would need to handle redirect yourself.

var page = require('webpage').create(),
    fs = require('fs');

function get(page, url) {
    return page.evaluate(function(url){
        var xhr = new XMLHttpRequest();
        xhr.open('GET', url, false);
        xhr.send(null);
        return xhr.responseText;
    }, url);
}

var url = 'http://example.com';

page.open(url, function(){
    var co = get(page, url);
    fs.write("original.html", co);
    fs.write("rendered.html", page.content);
    phantom.exit();
});

You can already see with this simple script that the two files are different despite not involving JavaScript.

enter image description here

You might need to run with the --web-security=false option. Instead of passing the url into the get() function, you may directly access page.url:

function get(page, url) {
    url = url || page.url;
    return page.evaluate(function(url){
        var xhr = new XMLHttpRequest();
        xhr.open('GET', url, false);
        xhr.send(null);
        return xhr.responseText;
    }, url);
}