httrack follow redirects

5k views Asked by At

I try to mirror webpages recursively starting from URL supplied by user (there is a depth limit set of course). Wget didn't catch links from css/js so I decided to use httrack.

I try to mirror some site like this:

# httrack <http://onet.pl> -r6 --ext-depth=6 -O ./a "+*"

This website uses redirect (301) to http://www.onet.pl:80, httrack just downloads index.html page with:

<a HREF="onet.pl/index.html" >Page has moved</a>

and nothing more! When I run:

# httrack <http://www.onet.pl> -r6 --ext-depth=6 -O ./a "+*"

it does what I want.

Is there any way to make httrack following redirects? Currently I just add "www."+url to httrack's URLs but it's not a real solution (doesn't cover all user cases). Are there any better website mirroring tools for linux?

2

There are 2 answers

1
neutrinus On BEST ANSWER

On main httrack forum one of developers said that it's not possible.

Proper solution is to use another web mirroring tool.

0
jav974 On

You could use this script to determine first the real target url and then run httrack against that url :

function getCorrectUrl($url) {
    $ch = curl_init();

    curl_setopt($ch, CURLOPT_HEADER, true);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

    curl_setopt($ch, CURLOPT_URL, $url);
    $out = curl_exec($ch);

    // line endings is the wonkiest piece of this whole thing
    $out = str_replace("\r", "", $out);

    // only look at the headers
    $headers_end = strpos($out, "\n\n");

    if ($headers_end !== false) {
        $out = substr($out, 0, $headers_end);
    }

    $headers = explode("\n", $out);

    foreach ($headers as $header) {
        if (substr($header, 0, 10) == "Location: ") {
            $target = substr($header, 10);
            return $target;
        }
    }

    return $url;
}