Downloading XML results with LWP::UserAgent in PERL

3.8k views Asked by At

I'm hoping for some assistance with a Perl issue.

I need to download an XML file that is the result of a query, parse the results, grab the next link from the XML file, download & repeat.

I have been able to download and parse the first result set fine.

I grab the next URL, but it seems that returned result never changes. I.e.: the second time through the loop, $res->content is the same as the first time. Therefore, the value of $url never changes after the first download.

I'm suspecting it is a scope problem, but I just cannot seem to get a handle on this.

use LWP::UserAgent;
use HTTP::Cookies;
use Data::Dumper;
use XML::LibXML;
use strict;

my $url = "http://quod.lib.umich.edu/cgi/f/findaid/findaid-idx?c=bhlead&cc=bhlead&type=simple&rgn=Entire+Finding+Aid&q1=civil+war&Submit=Search;debug=xml";

while ($url ne ""){

    my $ua = LWP::UserAgent->new();    
    $ua->agent('Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)');
    $ua->timeout(30);
    $ua->default_header('pragma' => "no-cache", 'max-age' => '0');

    print "Download URL:\n$url\n\n";

    my $res = $ua->get($url);

    if ($res->is_error) {
        print STDERR __LINE__, " Error: ", $res->status_line, " ", $res;
        exit;
    } 

    my $parser = XML::LibXML->new(); 
    my $doc = $parser->load_xml(string=>$res->content);

    #grab the url of the next result set
    $url = $doc->findvalue('//ResultsLinks/SliceNavigationLinks/NextHitsLink');

    print "NEXT URL:\n$url\n\n";

}
2

There are 2 answers

1
Ian Tegebo On

I suspect the doc you're getting isn't what you expect. It looks like you're fetching a some kind of search page and then trying to crawl the resulting pages. Make sure javascript isn't responsible for your fetch not returning the content you expect, as in this other question.

Also, you might try dumping the headers to see if you can find another clue:

use Data::Dumper;
print Dumper($res->headers), "\n";

As an aside, you should probably get in the habit of adding "use warnings" in case you already haven't.

0
Dodger On

The server may be giving you only default results without an HTTP_REFERER. I've seen some setups do this deliberately to discourage scraping.

Try this:

Before the while loop, add in:

my $referer;

Right before you have:

# grab the result of...

Add in:

$referer = $url

That way you save the previous URL before resetting it to the next one.

Then, in your UserAgent header settings, add that in:

    $ua->default_header(pragma => "no-cache", max-age => 0, Referer => $referer);

I won't say for sure that this is the problem, but in my experience that's where I'd start. Another option is to try it outside of LWP. Log all of your URLs into a file and try wget-ting them or lynx --source-ing them from the command line to see if you get different results than LWP gives you. If not, it's certainly somehting the server is doing and the trick is to find a way to work around it, is all... and the solution to the trick is just to more closely duplicate what a regular web browser does (thus, comparing your headers sent to the headers sent by Firebug in Firefox or the Inspector in Safari can help a lot)