HTML Treebuilder XPath to Extract Links

Question

HTML Treebuilder XPath to Extract Links

4.6k views Asked by Neon Flash At 31 July 2012 at 12:55

I am writing a basic script which just extracts all the links from a web page. It is written in Perl and makes use of WWW::Mechanize and HTML::Treebuilder::Xpath modules, both of which I have installed through CPAN.

I know it can be easily done using only WWW::Mechanize, however would like to learn to do it using XPath as well.

So, the script will parse the entire web page, and check the href attribute for every anchor tag, extract the link and print it to the console/write it to a file. Please note that in the script below, I have not used use strict, since I am only writing this to clarify and understand the concept of using XPath to traverse the HTML Tree.

here is the script:

#! /usr/bin/perl

use WWW::Mechanize;
use HTML::TreeBuilder::XPath;
use warnings;

$url="https://example.com";

$mech=WWW::Mechanize->new();
$mech->get($url);

$tree=HTML::TreeBuilder::XPath->new();

$tree->parse($mech->content);

$nodes=$tree->findnodes(q{'//a'}); # line is modified later.

foreach $node($nodes)
{
    print $node->attr('href');
}

And it gives an error:

Can't locate object method "attr" via package "XML::XPathEngine::Literal" at pagegetter.pl line 23.

I have modified the script as follows:

$nodes=$tree->findnodes(q{'//a/@href'});

while($node=$nodes->shift)
{
  print $node->attr('href');
}

Error:

Can't locate object method "shift" via package "XML::XPathEngine::Literal"

I am not sure, how to print the value of the href attribute.

$nodes should hold the list of all the href attributes? I believe it does not store the value but instead pointers to it?

I tried searching and reading examples, however I am not sure how to go about it.

Thanks.

Original Q&A

There are 1 answers

**daxim** · Accepted Answer · 2012-07-31T13:07:55+00:00

daxim On 31 July 2012 at 13:07 BEST ANSWER

There are a couple of mistakes. Repairs:

# list context
my @nodes = $tree->findnodes(
    q{//a}       # just a string, not a string containings quotes
);

# iterate over array
for my $node (@nodes) {

TechQA.

HTML Treebuilder XPath to Extract Links

There are 1 answers

Related Questions in HTML

Related Questions in PERL

Related Questions in XPATH

Related Questions in HTML-TREE

Popular Questions

Trending Questions