HTML::TreeBuilder inside a loop

209 views Asked by At

I'm trying to delete all table elements from several HTML files.

The following code runs perfectly on a single file, but when trying to automate the process it returns the error

can't call method "look_down" on an undefined value

Do you have any solution please?

Here is the code:

use strict;
use warnings;

use Path::Class;
use HTML::TreeBuilder;

opendir( DH, "C:/myfiles" );
my @files = readdir(DH);
closedir(DH);

foreach my $file ( @files ) {

    print("Analyzing file $file\n");

    my $tree = HTML::TreeBuilder->new->parse_file("C:/myfiles/$file");

    foreach my $e ( $tree->look_down( _tag => "table" ) ) {
        $e->delete();
    }

    use HTML::FormatText;
    my $formatter = HTML::FormatText->new;
    my $parsed    = $formatter->format($tree);

    print $parsed;
}
1

There are 1 answers

0
Borodin On

The problem is that you're feeding HTML::TreeBuilder all sorts of junk in addition to the HTML files that you intend. As well as any files in the opened directory, readdir returns the names of all subdirectories, as well as the pseudo-directories . and ... You should have seen this in the output from your print statement

print("Analyzing file $file\n");

One way to fix this is to check that each value in the loop is a file before processing it. Something like this

for my $file ( @files ) {

    my $path = "C:/myfiles/$file";
    next unless -f $path;

    print("Analyzing file $file\n");

    my $tree = HTML::TreeBuilder->new->parse_file($path);

    for my $table ( $tree->look_down( _tag => 'table' ) ) {
        $table->delete();
    }

    ...;
}

But it would be much cleaner to use a call to glob. That way you will only get the files that you want, and there is also no need to build the full path to each file

That would look something like this. You would have to adjust the glob pattern if your files don't all end with .html

for my $path ( glob "C:/myfiles/*.html" ) {

    print("Analyzing file $path\n");

    my $tree = HTML::TreeBuilder->new->parse_file($path);

    for my $table ( $tree->look_down( _tag => 'table' ) ) {
        $table->delete();
    }

    ...;
}

Strictly speaking, a directory name may also look like *.html, and if you don't trust your file structure you should also test that each result of glob is a file before processing it. But in normal situations where you know what's in the directory you're processing that isn't necessary