How to search for text in html-document with Mechanize?

1.1k views Asked by At

I am using WWW::Mechanize, HTML::TreeBuilder and HTML::Element in my perl-script to navigate through html-Documents.

I want to know how to search for an element, that contains a certain string as text.

Here is an example of an html-document:

<html>
  <body>
    <ul>
      <li>
       <div class="red">Apple</div>
       <div class="abc">figure = triangle</div>
      </li>
      <li>
       <div class="red">Banana</div>
       <div class="abc">figure = square</div>
      </li>
      <li>
       <div class="green">Lemon</div>
       <div class="abc">figure = circle</div>
      </li>
      <li>
       <div class="blue">Banana</div>
       <div class="abc">figure = line</div>
      </li>
    </ul>
  </body>
</html>

I want to extract the text square. To get it, I have to search for an element with this properties:

  • tag-name is "div"
  • class is "red"
  • content is text "Banana"

Then I need to get it's parent (a <li>-element), and from the parent the child who's text starts with figure =, but this, and the rest, is easy.

I tried it this way:

use strict;
use warnings;
use utf8;
use Encode;
use WWW::Mechanize;
use HTML::TreeBuilder;
use HTML::Element;

binmode STDOUT, ":utf8";

my $mech = WWW::Mechanize->new();

my $uri = 'http.....'; #URI of an existing html-document

$mech->get($uri);
if (($mech->success()) && ($mech->is_html())) {
    my $resp = $mech->response();
    my $cont = $resp->decoded_content;
    my $root = HTML::TreeBuilder->new_from_content($cont);

    #this works, but returns 2 elements:
    my @twoElements = $root->look_down('_tag' => 'div', 'class' => 'red');

    #this returns an empty list:
    my @empty = $root->look_down('_tag' => 'div', 'class' => 'red', '_content' => 'Banana');

    # do something with @twoElements or @empty   
}

What must I use instead the last command to get the wanted element?

I am not looking for a workaround (I've found one). What I want to have is a native function of WWW::Mechanize, HTML::Tree or any other cpan-modul.

1

There are 1 answers

4
Len Jaffe On

here's psuedocode/unttested Perl:

  my @twoElements = $root->look_down('_tag' => 'div', 'class' => 'red');
  foreach my $e ( @twoElements ) {
     next unless $e->content_list->[0] eq 'Banana';
     my $e2 = $e->right;   # get the sibling - might need to try left() depending on ordering
     my ($shape) = $e2->content_list->[0] =~ /figure = (.+)/;

     # do something with shape...

  }

Not perfect, but it should get you started, and it's general enough to reuse easily. otherwise replace

    ($shape) = $e2->content_list->[0] =~ /figure = (.+)/;

with something like

$shape = 'square' if $e2->content_list->[0] =~ /square/;

This might be a little cleaner:

my @elements = $root->look_down('_tag' => 'div', 'class' => 'red' ); foreach my $e ( @elements ) { next unless $e->as_trimmed_text eq 'Banana'; my $e2 = $e->right; my ($shape) = $e2->as_trimmed_text =~ /figure = (.+)/;

     # do something with shape...
  }

WWW::Mechanize::TreeBuilder