Scrape HTML files with Perl, returning content only, in order

Question

Scrape HTML files with Perl, returning content only, in order

315 views Asked by sqldoug At 02 September 2015 at 19:34

Using HTML::TreeBuilder -- or Mojo::DOM -- I'd like to scrape the content but keep it in order, so that I can put the text values into an array (and then replace the text values with a variable for templating purposes)

But this in TreeBuilder

my $map_r = $tree->tagname_map();

my @contents = map { $_->content_list } $tree->find_by_tag_name(keys %$map_r);

foreach my $c (@contents) {
  say $c;
}

doesn't return the order -- of course hashes aren't ordered. So, how to visit the tree from root down and keep the sequence of values returned? Recursively walk the tree? Essentially, I'd like to use the method 'as_text' except for each element. (Followed this nice idea but I need it for all elements)

Original Q&A

There are 1 answers

**sqldoug** · Answer 1 · 2015-09-09T20:44:49+00:00

sqldoug On 09 September 2015 at 20:44

This is better (using Mojo::DOM):

$dom->parse($html)->find('*')->each(
    sub {
        my $text = shift->text;
        $text =~ s/\s+/ /gi;
        push @text, $text;
    }
  );

However, any further comments are welcome.

TechQA.

Scrape HTML files with Perl, returning content only, in order

There are 1 answers

Related Questions in PERL

Related Questions in MOJOLICIOUS

Related Questions in HTML-TREE

Related Questions in HTML-TREEBUILDER

Related Questions in MOJO-DOM

Popular Questions

Popular Tags

Trending Questions