How exactly does the "parent" function from HTML::TreeBuilder work?

374 views Asked by At

The documentation on CPAN doesn't really explain this behavior unless I'm missing something. I've put together some quick test code to illustrate my problem:

#!/usr/bin/perl
use warnings;
use strict;

use HTML::TreeBuilder;

my $testHtml = " 
<body>
        <h1>
                <p> 
                        <p>HELLO!
                        </p> 
                </p> 
        </h1>
</body>";

my $parsedPage = HTML::TreeBuilder->new;
$parsedPage->parse($testHtml);
$parsedPage->eof();

my @p = $parsedPage->look_down('_tag' => 'p');

foreach (@p) {print $_->parent->tag, " : ", $_->tag, "\t", $_->as_text, "\n";}

After running the above script, the output is:

body : p

body : p        HELLO! 

Seeing as all the tags are nested one after another, I would think that the parent of the first p tag would be h1, and the parent of the second p tag would be p. Why is the parent function showing the body tag for both?

1

There are 1 answers

2
Dave Cross On BEST ANSWER

Your HTML is invalid. And given that HTML::TreeBuilder is a subclass of HTML::Parser, I can only assume that the parser is doing what it can to transform your document into valid HTML.

You can call $parsedPage->as_HTML to see what the parser has done to your HTML. It gives me this:

<html><head></head><body><h1></h1><p><p>HELLO! </body></html>

Perhaps you should pass your HTML through a validator or HTML::Tidy, before processing it.