Mirror multiple page site with lftp

562 views Asked by At

I need to mirror data hosted on a web site on a regular basis, I am trying to use lftp (version 4.0.9) as it usually does a great job for this task. However the site I am downloading from has multiple pages (I am intending to loop over the most recent n pages in a bash script which will run several times a day). I can't work out how to get lftp to accept the page parameter. I've had no luck searching for a solution online and what I have tried has failed so far.

This works perfectly:

lftp -c 'mirror -v -i "S1A" -P 4 https://qc.sentinel1.eo.esa.int/aux_resorb/'

This does not:

lftp -c 'mirror -v -i "S1A" -P 4 https://qc.sentinel1.eo.esa.int/aux_resorb/?page=2'

It gives error:

mirror: Access failed: 404 NOT FOUND (/aux_resorb/?page=2)

I also tried passing the new URL in as a variable but that didn't work either. I'd be grateful for suggestions to solve this issue.

Before it is suggested, I know wget is an option and the pagination works - I tested it - I don't want to use it because it is less appropriate for this as it wastes a lot of time getting all the "index.html?param=value" and then removing them, given the number of pages this isn't feasible.

1

There are 1 answers

2
PKo On BEST ANSWER

The problem with the lftp's mirror command is that it adds a slash to the given URL when requesting the page (see below). So it boils down how the remote end will handle URLs and whether it gets upset of the trailing slash. On my tests, Drupal sites for example do not like the trailing slash and will return a 404 but some other sites worked fine. Unfortunately I was not able to figure out a workaround if you insist of using lftp.

Tests

I tried the following requests against a web server:

1. lftp -c 'mirror -v http://example/path'
2. lftp -c 'mirror -v http://example/path/?page=2'
3. lftp -c 'mirror -v http://example/path/file'
4. lftp -c 'mirror -v http://example/path/file?page=2'

These commands resulted to the following HEAD requests seen by the web server:

1. HEAD /path/
2. HEAD /path/%3Fpage=2/
3. HEAD /path/file/
4. HEAD /path/file%3Fpage=2/

Note that there's always a trailing slash in the request. %3F is just the URL encoded character ?.