Perl HTML::TreeBuilder adding <html>, <head> and <body> tags to parsed content, how to stop or work around it?

860 views Asked by At

Background:
I'm using HTML::TreeBuilder to parse an entire html page, say "whole_page" for reference's sake. I'm then using the inherited parse_content method (same as for whole_page) of a new TreeBuilder object to to parse a chunk of html, say "html_to_insert". The root element of html_to_insert should be a <div> tag. Ultimately, the html_to_insert tree needs to be inserted into the the whole_page tree.

Problem:
The html_to_insert tree is being wrapped with <html>, <head> and <body> tags, which I obviously don't need. I looked at HTML::Parser to see if there was a parameter that might solve the problem, but I couldn't find anything.

Question:
Is there a simple way to stop the parse method from wrapping html_to_insert with the un-needed tags? Knowing what I'm trying to do, am I doing this ass backwards (is there a better way)?

Thanks for any help.

2

There are 2 answers

2
Tanktalus On

If you can ensure your HTML is XHTML-compliant, that is, it's a proper XML document, you may be able to use XML tools to do the job instead. In the past, I've used XML::Twig for this type of job, it was a bit easier that way.

Of course, if you're parsing arbitrary web pages from the internet, you may not have this type of guarantee.

0
bvr On

You might want to look on guts method in HTML::Tree. It returns only non-implicit nodes as a list.