Is there any library to help extract text from pdf from a rectangular region that can be used with PHP

2.4k views Asked by At

I am looking for some (preferably free) library that can help extract PDF text from a specified rectangular region which is specified by left, top, width and height parameters. It should be usable with PHP on a linux system. Could you please suggest such a library and a working example?

1

There are 1 answers

0
Kurt Pfeifle On

Commandline

PHP can use external commandline tools as well. So if this is an option for you...

If you use the commandline pdftotext -- but only the Poppler version, not the XPDF version! -- you have these optional CLI parameters:

  -x   : x-coordinate of the crop area top left corner
  -y   : y-coordinate of the crop area top left corner
  -W   : width of crop area in pixels (default is 0)
  -H   : height of crop area in pixels 

A working example:

First, let's create a PDF from the Bash man page, using Ghostscript:

man -t bash | gs -o man-bash.pdf -sDEVICE=pdfwrite -

Next, let's extract some text from it. Use a width 200, a height of 100, and the top left corner at (200,200) {you calculate here from the top left as being positioned at (0,0)}:

kp@mbp:~$  pdftotext -f 1 -l 1 -x 200 -y 200 -W 200 -H 100 man-bash.pdf -
 
 a conformant implementation of the Shell and Ut
 andard 1003.1). Bash can be configured to be POS
 
 acter shell options documented in the description
 the shell is invoked. In addition, bash interprets
 
 option is present, then commands are read from s

Note my usage of -f (for f irst page) and -l (for l ast page). If you don't use this, pdftotext will print the respective text region for every single page of a multi-page PDF.

Compare to this screenshot:

Screenshot of PDF with Bash man page, selected rectangle being highlighted

Looks like it worked as expected, no?

Library

Poppler

Poppler can also be used as a library. But I do not have any experience with this approach (nor much with PHP).

TET

If you cannot find a free library which fullfills your requirements, then hava a look at the best thing for text extraction from PDFs: TET, the text extraction toolkit. TET is part of the PDFlib.com family of products.

PDFlib.com is Thomas Merz's company. In case you don't recognize his name: Thomas Merz is one of the authors of the "PostScript and PDF Bible".

TET's first incarnation is a library. That one can probably do everything you ever wanted, including positional information about every element on the page.

pdflib.com also offers another incarnation of this technology, the TET plugin for Acrobat. And the third incarnation is the PDFlib TET iFilter. This is a standalone tool for Windows desktops. Both these are free (as in beer) to use for private, non-commercial purposes.

TET is way better than Adobe's own text extraction. It extracted text for me where other tools (including Adobe's) did spit out garbage only.

Give it a try.