XPDF pdftotext and page number handling

Question

XPDF pdftotext and page number handling

1.8k views Asked by chrisrth At 09 October 2012 at 14:16

Using perl to utilize pdftotext for the purpose of extracting text from a pdf. Works great. My issue is that the pdf's I am reading are multi-page and I am looking for data on specific lines at the top each page. The following code dumps the entire contents of both pages to one file. Because the data length after the constant data (at the top of page) varies I can't accurately pull my data from page 2. How would I step through each page either using pdftotext or some other utility/module first, then call pdftotext on each page individually?

#!/usr/bin/perl
print "Content-type: text/html\n\n";

print "\n<style>
div.line {width:100%;white-space:nowrap;}
div.line div {width:80px;float:left;}
</style>";

my $i=0;
open FILE, "pdftotext -layout my_multi_page_pdf.pdf - |";

while (<FILE>) {

    $i++;
    my ($line) = $_;
    print "\n<div class=\"line\"><div>$i</div>$line</div>";
}
close FILE;

Original Q&A

There are 1 answers

**chrisrth** · Accepted Answer · 2012-10-11T19:32:16+00:00

use strict;
use warnings;

my $i       = 0;
my $pageNum = 1;

open my $fh, "pdftotext -layout multipage.pdf - |" or die $!;
print "---------- Begin Page $pageNum ----------\n";

while ( my $line = <$fh> ) {
    if ( $line =~ /\xC/ ) {
        print "\n---------- End Page $pageNum ----------\n";
        $pageNum++;
        print "---------- Begin Page $pageNum ----------\n";
    }

    $i++;
    print "\n<div class=\"line\"><div>$i</div>$line</div>";
}

close $fh;

TechQA.

XPDF pdftotext and page number handling

There are 1 answers

Related Questions in PERL

Related Questions in CGI

Related Questions in XPDF

Popular Questions

Trending Questions