Trying to figure out how to push specific links contained in each link of separate list of links into an array

61 views Asked by At

GENERAL IDEA


Here is a snippet of what I'm working with:

my $url_temp;
my $page_temp;
my $p_temp;
my @temp_stuff;
my @collector;

foreach (@blarg_links) {
        $url_temp = $_;
        $page_temp = get( $url_temp ) or die $!;
        $p_temp = HTML::TreeBuilder->new_from_content( $page_temp );
        @temp_stuff = $p_temp->look_down(
                _tag => 'foo',
                class => 'bar'
        );
        foreach (@temp_stuff) {
                push(@collector, "http://www.foobar.sx" . $1) if $_->as_HTML =~ m/href="(.*?)"/;
        };
};

Hopefully it is clear that what I'm hopelessly trying to do is push the link endings found in each of a list of links into an array called @temp_stuff. So the first link in @blarg_links, when visited, has greater than or equal to 1 foo tag with an associated bar class that when acted on by as_HTML will match something I want in the href equality to then pump into an array of links which have the data I'm really after... Does that make sense?


ACTUAL DATA


my $url2 = 'http://www.chemistry.ucla.edu/calendar-node-field-date/year';
my $page2 = get( $url2 ) or die $!;
my $p2 = HTML::TreeBuilder->new_from_content( $page2 );

my @stuff2 = $p2->look_down(
        _tag => 'div',
        class => 'year mini-day-on'
);

my @chem_links;

foreach (@stuff2) {
        push(@chem_links, $1) if $_->as_HTML =~ m/(http:\/\/www\.chemistry\.ucla\.edu\/calendar-node-field-date\/day\/[0-9]{4}-[0-9]{2}-[0-9]{2})/;
};

my $url_temp;
my $page_temp;
my $p_temp;
my @temp_stuff;
my @collector;

foreach (@chem_links) {
        $url_temp = $_;
        $page_temp = get( $url_temp ) or die $!;
        $p_temp = HTML::TreeBuilder->new_from_content( $page_temp );
        @temp_stuff = $p_temp->look_down(
                _tag => 'span',
                class => 'field-content'
        );
};

foreach (@temp_stuff) {
                push(@collector, "http://www.chemistry.ucla.edu" . $1) if $_->as_HTML =~ m/href="(.*?)"/;
};

n.b. - I want to use HTML::TreeBuilder. I'm aware of alternatives.


2

There are 2 answers

0
Borodin On

This is a rough attempt at what I think you want.

It fetches all the links on the first page and visits each of them in turn, printing the link in each <span class="field-content"> element.

use strict;
use warnings;
use 5.010;

use HTML::TreeBuilder;

STDOUT->autoflush;

my $url = 'http://www.chemistry.ucla.edu/calendar-node-field-date/year';
my $tree = HTML::TreeBuilder->new_from_url($url);

my @chem_links;

for my $div ( $tree->look_down( _tag => 'div', class => qr{\bmini-day-on\b} ) ) {
  my ($anchor)= $div->look_down(_tag => 'a', href => qr{http://www\.chemistry\.ucla\.edu});
  push @chem_links, $anchor->attr('href');
};

my @collector;

for my $url (@chem_links) {

  say $url;

  my $tree = HTML::TreeBuilder->new_from_url($url);

  my @seminars;

  for my $span ( $tree->look_down( _tag => 'span', class => 'field-content' ) ) {
    my ($anchor) = $span->look_down(_tag => 'a', href => qr{/});
    push @seminars, 'http://www.chemistry.ucla.edu'.$anchor->attr('href');
  }

  say "  $_" for @seminars;
  say '';

  push @collector, @seminars;
};
0
Miller On

For a more modern framework for parsing webpages, I would suggest you take a look at Mojo::UserAgent and Mojo::DOM. Instead of having to manually march through each section of your html tree, you can use the power of css selectors to zero in on the specific data that you want. There's a nice 8 minute introductory video on the framework at Mojocast Episode 5.

# Parses the UCLA Chemistry Calendar and displays all seminar links

use strict;
use warnings;

use Mojo::UserAgent;
use URI;

my $url = 'http://www.chemistry.ucla.edu/calendar-node-field-date/year';

my $ua = Mojo::UserAgent->new;
my $dom = $ua->get($url)->res->dom;

for my $dayhref ($dom->find('div.mini-day-on > a[href*="/day/"]')->attr('href')->each) {
    my $dayurl = URI->new($dayhref)->abs($url);
    print $dayurl, "\n";

    my $daydom = $ua->get($dayurl->as_string)->res->dom;
    for my $seminarhref ($daydom->find('span.field-content > a[href]')->attr('href')->each) {
        my $seminarurl = URI->new($seminarhref)->abs($dayurl);
        print "  $seminarurl\n";
    }

    print "\n";
}

Output is identical to that of Borodin's solution using HTML::TreeBuilder:

http://www.chemistry.ucla.edu/calendar-node-field-date/day/2014-01-06
  http://www.chemistry.ucla.edu/seminars/nano-rheology-enzymes

http://www.chemistry.ucla.edu/calendar-node-field-date/day/2014-01-09
  http://www.chemistry.ucla.edu/seminars/imaging-approach-biology-disease-through-chemistry

http://www.chemistry.ucla.edu/calendar-node-field-date/day/2014-01-10
  http://www.chemistry.ucla.edu/seminars/arginine-methylation-%E2%80%93-substrates-binders-function
  http://www.chemistry.ucla.edu/seminars/special-inorganic-chemistry-seminar

http://www.chemistry.ucla.edu/calendar-node-field-date/day/2014-01-13
  http://www.chemistry.ucla.edu/events/robert-l-scott-lecture-0

...