About Perl reading the webpage online via HTTP

361 views Asked by At

I have a huge webpage, which is about 5G size. And I hope I could read the content of the webpage directly(remotely) without downloading the whole file. I have used the Open File Handler to open the HTTP content. But the error message given is No such files or directory. I tried to use LWP::Simple, but it was out of memory if I use get the whole content. I wonder if there is a way that I could open this content remotely, and read line by line. Thank you for your help.


There are 2 answers

mvp On

This Perl code will download file from URL with possible continuation if file was already partially downloaded.

This code requires that server returns file size (aka content-length) on HEAD request, and also requires that server supports byte ranges on URL in question.

If you want some special processing for next chunk, just override it below:

use strict;
use LWP::UserAgent;
use List::Util qw(min max);

my $url  = "http://example.com/huge-file.bin";
my $file = "huge-file.bin";

DownloadUrl($url, $file);

sub DownloadUrl {
    my ($url, $file, $chunksize) = @_;
    $chunksize ||= 1024*1024;
    my $ua = new LWP::UserAgent;
    my $res = $ua->head($url);
    my $size = $res->headers()->{"content-length"};
    die "Cannot get size for $url" unless defined $size;
    open FILE, ">>$file" or die "ERROR: $!";      
    for (;;) {
        flush FILE;
        my $range1 = -s FILE;        
        my $range2 = min($range1 + $chunksize, $size);
        last if $range1 eq $range2;
        $res = $ua->get($url, Range => "bytes=$range1-$range2");
        last unless $res->is_success();
        # process next chunk:
        print FILE $res->content();
    close FILE;
chipschipschips On

You could try using LWP::UserAgent. The request method allows you to specify a CODE reference, which would let you process the data as it's coming in.

#!/usr/bin/perl -w

use strict;
use warnings;

use LWP::UserAgent ();
use HTTP::Request ();

my $request = HTTP::Request->new(GET => 'http://www.example.com/');
my $ua = LWP::UserAgent->new();

$ua->request($request, sub {
        my ($chunk, $res) = @_;
        print $chunk;
        return undef;

Technically the function should return the content instead of undef, but it seems to work if you return undef. According to the documentation:

The "content" function should return the content when called. The content function will be invoked repeatedly until it return an empty string to signal that there is no more content.

I haven't tried this on a large file, and you would need to write your own code to handle the data coming in as arbitrarily sized chunks.