Trying to mirror site that uses strapdown.js

634 views Asked by At

there is a site that uses strapdown.js that I am trying to mirror using httrack or wget, but I fall short, because the site contains markdown and not HTML. Only strapdown converts the links to html links. Hence the client needs to interpret Javascript first and then search for links in the generated dom.

Is there a tool in the market that is able to do this?

I have tried

wget -erobots=off --no-parent --wait=3 --limit-rate=20K -r -p -U "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)" -A htm,html,css,js,json,gif,jpeg,jpg,bmp http://my.si.te

and

httrack -w -v --extended-parsing=N -n -t -r -p -U "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)" --robots=0 http://my.si.te "+*" "-r6"

Any help is highly appreciated.

1

There are 1 answers

3
J Richard Snape On

If you are comfortable in Java to write your client, I have used HTMLUnit.

A stripped down example to fetch a page with Javascript would look like the following. It's adapted from an actual script I use to scrape one of the sites I administer. I've used the strapdownjs.com as the example. You'll have to ignore the css warnings if you run it, but you'll notice it finds and outputs the link to bootswatch.com, generated by javascript from the markdown in the page source. You might prefer the tool's own Getting started page.

import java.io.IOException;
import java.net.MalformedURLException;
import java.util.List;

import com.gargoylesoftware.htmlunit.*;
import com.gargoylesoftware.htmlunit.html.*;

public class WebGetter
{

    // Set up the client (i.e. gui-less browser)
    public static void main(String[] args) throws FailingHttpStatusCodeException,  MalformedURLException, IOException
    {
        final WebClient webClient = new WebClient(BrowserVersion.FIREFOX_3_6);
        webClient.setThrowExceptionOnScriptError(false);
        webClient.setJavaScriptEnabled(true);
        webClient.setAjaxController(new NicelyResynchronizingAjaxController());
        webClient.setJavaScriptTimeout(20000);
        webClient.waitForBackgroundJavaScript(20000);

        // Get the page you want (store as HTMLUnit object HtmlPage)
        String url = "http://strapdownjs.com/";
        HtmlPage page = webClient.getPage(url);

        // Use some of the HTMLUnit functionality to look at the DOM (e.g. here,
        // find all links)
        List<HtmlAnchor> allLinks = page.getAnchors();
        for (HtmlAnchor a : allLinks)
        {
            System.out.println(a.asText());
        }
    }
}