Trying to mirror site that uses strapdown.js

Question

Trying to mirror site that uses strapdown.js

631 views Asked by Buddy At 26 November 2014 at 12:56

there is a site that uses strapdown.js that I am trying to mirror using httrack or wget, but I fall short, because the site contains markdown and not HTML. Only strapdown converts the links to html links. Hence the client needs to interpret Javascript first and then search for links in the generated dom.

Is there a tool in the market that is able to do this?

I have tried

wget -erobots=off --no-parent --wait=3 --limit-rate=20K -r -p -U "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)" -A htm,html,css,js,json,gif,jpeg,jpg,bmp http://my.si.te

and

httrack -w -v --extended-parsing=N -n -t -r -p -U "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)" --robots=0 http://my.si.te "+*" "-r6"

Any help is highly appreciated.

Original Q&A

There are 1 answers

**J Richard Snape** · Answer 1 · 2014-11-27T10:17:58+00:00

If you are comfortable in Java to write your client, I have used HTMLUnit.

A stripped down example to fetch a page with Javascript would look like the following. It's adapted from an actual script I use to scrape one of the sites I administer. I've used the strapdownjs.com as the example. You'll have to ignore the css warnings if you run it, but you'll notice it finds and outputs the link to bootswatch.com, generated by javascript from the markdown in the page source. You might prefer the tool's own Getting started page.

import java.io.IOException;
import java.net.MalformedURLException;
import java.util.List;

import com.gargoylesoftware.htmlunit.*;
import com.gargoylesoftware.htmlunit.html.*;

public class WebGetter
{

    // Set up the client (i.e. gui-less browser)
    public static void main(String[] args) throws FailingHttpStatusCodeException,  MalformedURLException, IOException
    {
        final WebClient webClient = new WebClient(BrowserVersion.FIREFOX_3_6);
        webClient.setThrowExceptionOnScriptError(false);
        webClient.setJavaScriptEnabled(true);
        webClient.setAjaxController(new NicelyResynchronizingAjaxController());
        webClient.setJavaScriptTimeout(20000);
        webClient.waitForBackgroundJavaScript(20000);

        // Get the page you want (store as HTMLUnit object HtmlPage)
        String url = "http://strapdownjs.com/";
        HtmlPage page = webClient.getPage(url);

        // Use some of the HTMLUnit functionality to look at the DOM (e.g. here,
        // find all links)
        List<HtmlAnchor> allLinks = page.getAnchors();
        for (HtmlAnchor a : allLinks)
        {
            System.out.println(a.asText());
        }
    }
}

TechQA.

Trying to mirror site that uses strapdown.js

There are 1 answers

Related Questions in JAVASCRIPT

Related Questions in LINUX

Related Questions in UNIX

Related Questions in WGET

Related Questions in HTTRACK

Popular Questions

Popular Tags

Trending Questions