HTTrack possible using cookies

15k views Asked by At

I want to download the page from a URL, easy enough. But on the first page I have to login, as I normally do from a normal browser. But HTTrack is downloading from the first page since it can't use my cookies or login.

Is it any way for me to get around this?

3

There are 3 answers

4
Kohjah Breese On

Try using cURL in PHP:

http://php.net/manual/en/book.curl.php

There are wrappers for this, like:

http://semlabs.co.uk/journal/object-oriented-curl-class-with-multi-threading

Use options such as:

EDIT: More specific, not tested

Download the class from:

http://semlabs.co.uk/journal/object-oriented-curl-class-with-multi-threading

require_once( 'CURL.php' ); //Change this to whatever that class is called in the above
$curl = new CURL();  
$curl->retry = 2;  
    $opts = array(
    CURLOPT_USERAGENT => 'Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.3) Gecko/20091020 Linux Mint/8 (Helena) Firefox/3.5.3',
    CURLOPT_COOKIEFILE  => 'fb.tmp',
    CURLOPT_COOKIEJAR   => 'fb.tmp',
    CURLOPT_FOLLOWLOCATION  => 1,
    CURLOPT_RETURNTRANSFER  => 1,
    CURLOPT_SSL_VERIFYHOST  => 0,
    CURLOPT_SSL_VERIFYPEER  => 0,
    CURLOPT_TIMEOUT     => 20
);
$post_data = array(  ); //put your login POST data here
$opts[CURLOPT_POSTFIELDS] = http_build_query( $post_data );
$curl->addSession( 'https://www.facebook.com/messages', $opts );  
$result = $curl->exec();  
$curl->clear();
print_r( $result );

Note, that sometimes you need to load a page first, to set a cookie, before they will let you login.

4
Frank Einstein On

This question was asked in 2013 so I'm not sure if Httrack was supporting cookies back then but now it definitely does.

Instructions:

  1. Login to your website using Firefox or Chrome, then take a look at the login cookie.
  2. In the root of the folder where you are downloading your website open the file named cookies.txt or if it's not there just create one and open it.
  3. Copy the login cookie from your browser to this file.
    (You can also copy every cookies if you don't know which one is the login)
  • It might not be required but if you're having problems with your cookies in Httrack, you can try copying your User-Agent from your browser to your Httrack config.
    (I Usually use the User-Agent of my browser, just to be safe.)

  • If you don't know how to look at your cookies, it's pretty simple...
    You can use the Developer Tools like so:
    Firefox: F12 -> Storage -> Cookies
    Chrome: F12 -> Application -> Storage -> Cookies

Example of a cookie.txt for Httrack:
(Make sure to use Tabs in your cookies.txt, spaces don't seem to be working. StackOverflow is automatically converting these Tabs into Spaces.)

www.httrack.com TRUE    /       FALSE   1999999999  foo bar
www.example.com TRUE    /folder FALSE   1999999999  JSESSIONID  xxx1234
www.example.com TRUE    /hello  FALSE   1999999999  JSESSIONID  yyy1234

Reference: http://httrack.kauler.com/help/Cookies

0
tuomassalo On

Adding to Frank Einstein's answer:

You might not need cookies.txt, as httrack also has --headers option. So, first copy the relevant session cookie from the brwoser, and then you can use:

httrack --headers 'Cookie: SESSIONID=1234...' ...