I want to extract all the tables from an html file and print their contents in the following way each cell seperated by \t
, each row separated by \n
and each table separated by \n\n
. The following is my script, when I changed it to findvalues on tr then whole tr is inserted as one element, and I even tried the other methods such as findnodes_as_strings ($path), I want to modify it to the above mentioned structure .
use strict;
use warnings;
use HTML::TreeBuilder::XPath;
my $tree= HTML::TreeBuilder::XPath->new;
$tree->parse_file( "html.html");
my @values=$tree->findvalues(q{//table//tr//td});
print $_, "\n" foreach(@values);
You need to process each table separately, same for rows:
Of course this is solution only for simple tables (think about columnspans, th, table inside table etc.)