Scrape HTML files with Perl, returning content only, in order

312 views Asked by At

Using HTML::TreeBuilder -- or Mojo::DOM -- I'd like to scrape the content but keep it in order, so that I can put the text values into an array (and then replace the text values with a variable for templating purposes)

But this in TreeBuilder

my $map_r = $tree->tagname_map();

my @contents = map { $_->content_list } $tree->find_by_tag_name(keys %$map_r);

foreach my $c (@contents) {
  say $c;
}

doesn't return the order -- of course hashes aren't ordered. So, how to visit the tree from root down and keep the sequence of values returned? Recursively walk the tree? Essentially, I'd like to use the method 'as_text' except for each element. (Followed this nice idea but I need it for all elements)

1

There are 1 answers

0
sqldoug On

This is better (using Mojo::DOM):

$dom->parse($html)->find('*')->each(
    sub {
        my $text = shift->text;
        $text =~ s/\s+/ /gi;
        push @text, $text;
    }
  );

However, any further comments are welcome.