" I would like to resume this download without checking the dates on the server for every fi" /> " I would like to resume this download without checking the dates on the server for every fi" /> " I would like to resume this download without checking the dates on the server for every fi"/>

Resume an aborted recursive download with wget without checking the dates for already downloaded files

367 views Asked by At

The following command was aborted:

wget -w 10 -m -H "<URL>"

I would like to resume this download without checking the dates on the server for every file that I've already downloaded.

I'm using: GNU Wget 1.21.3 built on darwin18.7.0.

The following doesn't work for me because it keeps requesting headers at a rate of 1 every 10 seconds, to not overwhelm the server, and then it doesn't download the files, but checking is very slow. 10 seconds times 80,000 files is a long time, and if it aborts again after 300,000 files resuming using this command will take even longer. In fact it takes as long as starting over, which I'd like to avoid.

wget -c -w 10 -m -H "<URL>"

The following is not recursive as the first file exists and subsequently not parsed for URLs to recursively download everything else.

wget -w 10 -r -nc -l inf --no-remove-listing -H "<URL>"

The result of this command is this:

File ‘<URL>’ already there; not retrieving.

The file that's "already there" contains links that should be followed, and if those files are "already there" then they too should not be retrieved. This process should continue until wget encounters files that haven't yet been downloaded.

I need to download 600,000 files without overwhelming the server and have already downloaded 80,000 files. wget should be able to zip through all the downloaded files really fast until it finds a missing file that it needs to download and then rate limit the downloads to 1 every 10 seconds.

I've read through the entire man page and can't find anything that looks like it will work except for what I have already tried. I don't care about the dates on the files, retrieving updated files, or downloading the rest of incomplete files. I only want to download files from the 600,000 that I haven't already downloaded without bogging down the server with unnecessary requests.

1

There are 1 answers

3
Daweo On

The file that's "already there" contains links that should be followed

If said file contains absolute links then you might try using combination of --force-html and -i file.html consider following simple example, let file.html content be

<html>
<body>
<a href="http://www.example.com">Example</a>
<a href="http://www.duckduckgo.com">Search</a>
<a href="http://archive.org">Archive</a>
</body>
</html>

then

wget --force-html -i file.html -nc -r -l 1

does create following structure

file.html
www.example.com/index.html
www.duckduckgo.com/index.html
archive.org/index.html
archive.org/robots.txt
archive.org/index.html?noscript=true
archive.org/offshoot_assets/index.34c417fd1d63.css
archive.org/offshoot_assets/favicon.ico
archive.org/offshoot_assets/js/webpack-runtime.e618bedb4b40026e6d03.js
archive.org/offshoot_assets/js/index.60b02a82057240d1b68d.js
archive.org/offshoot_assets/vendor/[email protected]/polyfill-support.js
archive.org/offshoot_assets/vendor/@webcomponents/[email protected]/webcomponents-loader.js

and if you remove one of files, say archive.org/offshoot_assets/favicon.ico then subsequent run will download only that missing file.