How can I navigate Word tables using WIN32::OLE perl package?

216 views Asked by At

I have a directory with hundreds of word docs, each containing a standardized set of tables. I need to parse these tables and extract the data in them. I developed the script that spits out the entire tables.

#!/usr/bin/perl;
use strict;
use warnings;

use Carp qw( croak );
use Cwd qw( abs_path );
use Path::Class;
use Win32::OLE qw(in);
use Win32::OLE::Const 'Microsoft Word';
$Win32::OLE::Warn = 3;
=d
my $datasheet_dir = "./path/to/worddocs";
my @files = glob "$datasheet_dir/*.doc";
print "scalar: ".scalar(@files)."\n";
foreach my $f (@files){
    print $f."\n";
}
=cut
#my $file = $files[0];
my $file = "word.doc";
print "file: $file\n";

run(\@files);

sub run {
    my $argv = shift;
    my $word = get_word();

    $word->{DisplayAlerts} = wdAlertsNone;
    $word->{Visible}       = 1;

    for my $word_file ( @$argv ) {
        print_tables($word, $word_file);
    }

    return;
}

sub print_tables {
    my $word = shift;
    my $word_file = file(abs_path(shift));

    my $doc = $word->{Documents}->Open("$word_file");
    my $tables = $word->ActiveDocument->{Tables};

    for my $table (in $tables) {
        my $text = $table->ConvertToText(wdSeparateByTabs)->Text;
        $text =~ s/\r/\n/g;
        print $text, "\n";
    }

    $doc->Close(0);
    return;
}

sub get_word {
    my $word;
    eval { $word = Win32::OLE->GetActiveObject('Word.Application'); 1 }
        or die "$@\n";
    $word and return $word;
    $word = Win32::OLE->new('Word.Application', sub { $_[0]->Quit })
        or die "Oops, cannot start Word: ", Win32::OLE->LastError, "\n";
    return $word;
}

Is there a way to navigate the cells? I want to only return rows that have a specific value in the first column?

For example, for the following table, I want to only grep the rows that have fruit in the first column.

apple       pl
banana      xml
California  csv
pickle      txt
Illinois    gov
pear        doc
1

There are 1 answers

0
martin clayton On

You could use OLE to access the individual cells of the table, after first getting the dimensions using the Columns object and Rows collection.

Or you could post-process the text into a Perl array, and iterate that. Instead of

my $text = $table->ConvertToText(wdSeparateByTabs)->Text;
$text =~ s/\r/\n/g;
print $text, "\n";

something like

my %fruit; # population of look-up table of fruit omitted

my $text = $table->ConvertToText(wdSeparateByTabs)->Text;
my @lines = split /\r/, $text;
for my $line ( @lines ) {
    my @fields = split /\t/, $lines;

    next unless exists $fruit{$fields[0]};

    print "$line\n";
}

Refinements for case sensitivity, etc., can be added as needed.