About Perl reading the webpage online via HTTP

Question

About Perl reading the webpage online via HTTP

344 views Asked by Chris Andrews At 31 January 2013 at 05:19

I have a huge webpage, which is about 5G size. And I hope I could read the content of the webpage directly(remotely) without downloading the whole file. I have used the Open File Handler to open the HTTP content. But the error message given is No such files or directory. I tried to use LWP::Simple, but it was out of memory if I use get the whole content. I wonder if there is a way that I could open this content remotely, and read line by line. Thank you for your help.

Original Q&A

There are 2 answers

**mvp** · Answer 1 · 2013-01-31 08:00:53

This Perl code will download file from URL with possible continuation if file was already partially downloaded.

This code requires that server returns file size (aka content-length) on HEAD request, and also requires that server supports byte ranges on URL in question.

If you want some special processing for next chunk, just override it below:

use strict;
use LWP::UserAgent;
use List::Util qw(min max);

my $url  = "http://example.com/huge-file.bin";
my $file = "huge-file.bin";

DownloadUrl($url, $file);

sub DownloadUrl {
    my ($url, $file, $chunksize) = @_;
    $chunksize ||= 1024*1024;
    my $ua = new LWP::UserAgent;
    my $res = $ua->head($url);
    my $size = $res->headers()->{"content-length"};
    die "Cannot get size for $url" unless defined $size;
    open FILE, ">>$file" or die "ERROR: $!";      
    for (;;) {
        flush FILE;
        my $range1 = -s FILE;        
        my $range2 = min($range1 + $chunksize, $size);
        last if $range1 eq $range2;
        $res = $ua->get($url, Range => "bytes=$range1-$range2");
        last unless $res->is_success();
        # process next chunk:
        print FILE $res->content();
    }
    close FILE;
}

**chipschipschips** · Answer 2 · 2013-01-31 06:10:14

You could try using LWP::UserAgent. The request method allows you to specify a CODE reference, which would let you process the data as it's coming in.

#!/usr/bin/perl -w

use strict;
use warnings;

use LWP::UserAgent ();
use HTTP::Request ();

my $request = HTTP::Request->new(GET => 'http://www.example.com/');
my $ua = LWP::UserAgent->new();

$ua->request($request, sub {
        my ($chunk, $res) = @_;
        print $chunk;
        return undef;
});

Technically the function should return the content instead of undef, but it seems to work if you return undef. According to the documentation:

The "content" function should return the content when called. The content function will be invoked repeatedly until it return an empty string to signal that there is no more content.

I haven't tried this on a large file, and you would need to write your own code to handle the data coming in as arbitrarily sized chunks.

TechQA.

About Perl reading the webpage online via HTTP

There are 2 answers

Related Questions in PERL

Related Questions in FILE

Related Questions in HTTP

Related Questions in WEBCONTENT

Popular Questions

Popular Tags

Trending Questions