Using CAM::PDF for Perl - Can not extract image from pdf

1.4k views Asked by At

I have a pdf file that listimages.pl which uses CAM::PDF returns nothing but using PDF::GetImages will extract an image. Using the following code I can find the image object but I don't know how to extract that to a file. And I can not figure out why the command line tools don't work.

#!/usr/bin/perl -w
use strict;

use Cwd;
use File::Basename;
use Data::Dumper;
use CAM::PDF;
use CAM::PDF::PageText;
use CAM::PDF::Renderer::Images;

my $file = shift @ARGV || die "Usage: get-pdf-images /path/to/file.pdf \n";

my $pdf = CAM::PDF->new($file) || die "$CAM::PDF::errstr\n";

#print $pdf->toString();

foreach my $p ( 1 .. $pdf->numPages() ) {
    my $page = $pdf->getPageContentTree($p);
    my $str = $pdf->getPageText($p);
    if (defined $str) {
#        CAM::PDF->asciify(\$str);
        print $str;
    }

    print "-------------------------------\n";
    my $gs = $page->findImages();
    my @imageNodes = @{$gs->{images}};
    print "Found " . scalar @imageNodes . " images on page $p\n";
    print Data::Dumper->Dump([\@imageNodes],['imageNodes']);
}

If I run `pdfinfo.pl`` it reports:

$ pdfinfo.pl test.pdf
File:         test.pdf
File Size:    4599 bytes
Pages:        1
Author:       þÿadmin01
CreationDate: Fri Jan  3 03:48:53 2014
Creator:      þÿPDFCreator Version 1.7.2
Keywords:
ModDate:      Fri Jan  3 03:48:53 2014
Producer:     GPL Ghostscript 9.10
Subject:
Title:        þÿVision6Card
Page Size:    variable
Optimized:    no
PDF version:  1.4
Security
  Passwd:     none
  Print:      yes
  Modify:     yes
  Copy:       yes
  Add:        yes

The test.pdf file can be downloaded from here: http://imaptools.com:8080/dl/test.pdf

1

There are 1 answers

2
user2846289 On BEST ANSWER

Some parts of CAM::PDF are unfinished. If you look at source of listimages.pl, you'll see that content parsing for inline images is somewhat primitive, e.g. it doesn't allow unmatched parens between BI and EI (as is the case) and so doesn't find the image here. There's uninlinepdfimages.pl, it uses another heuristics to parse for inline images, but for this file it seems to hang and I don't have the intention to look into what confuses it. And, CAM::PDF::Renderer::Images, as in your code, is another take on the same problem and finally it does proper parsing of content stream, but the library seems to provide no means to help to extract image data here. But if you need it VERY much, I see no technical problem (except your time), given information in @imageNodes (width, height, depth, compression used, imagedata), to extract image programatically.