Web crawler text formatting

609 views Asked by At

I have the following code to access a HTML table.

my $table = $tree->look_down(_tag => "table", id => "moduleDetail");

however the text is coming down not formatted, because the web page uses the tables borders to divide certain pieces of text. So its coming down something like this, "mathematics for computingJordanstown" with jordanstown being I assume in the next cell. here is the code that i am using,

my @array; 
my $tree = HTML::TreeBuilder->new_from_content($mech->content);  
my $table = $tree->look_down(_tag => "table", id => "moduleDetail");




    for ($table ->look_down(_tag => 'tr')) {

                push(@array,$_->as_text());

    }

    foreach(@array){
           print $_, " ";
                    }
$tree->delete();

Note i tried to separate the text using and array but no luck? any pointers. Thanks

2

There are 2 answers

9
Brant Olsen On BEST ANSWER

Using HTML::TreeBuilder::XPath

I suggest using the Perl module HTML::TreeBuilder::XPath for this. It should give you exactly what you want.

From the documentation, I believe your code would look like this using the XPath module

my $tree = HTML::TreeBuilder::XPath->new_from_content($mech->content);
my @trArray = $tree->findnodes_as_string( '//table[@id="moduleDetail"]/tr/td');
$tree->delete();

For more information on XPath see http://www.w3schools.com/xpath/.

Using HTML::TreeBuilder

If you want to stick with using HTML::TreeBuilder, then you will need to do the following

my $tree = HTML::TreeBuilder->new_from_content($mech->content);  
my $table = $tree->look_down(_tag => "table", id => "moduleDetail");
for ($table->look_down(_tag => 'td')) {
  push(@array,$_->as_text());   
}
0
Borodin On

Accessing text nodes of the HTML tree is made much easier if you call the objectify_text method on the tree. This changes the text nodes from simple strings to instances of HTML::Element with a pseudo tag name of ~text and an attribute called text equal to the text string. This allows the look_down method to search for text nodes.

If you recode like this you will get the value of each separate text node pushed onto the array.

my $tree = HTML::TreeBuilder->new_from_content($mech->content);  
$tree->objectify_text;

my $table = $tree->look_down(_tag => "table", id => "moduleDetail");

my @text; 

for my $tr ($table->look_down(_tag => '~text')) {
  my $text = $tr->attr('text');
  push @text, $text if $text =~ /\S/;
}

print "$_\n" for @text;