BASH script to check PDF's are ocr'd

3k views Asked by At

Don't know where to start on this really

I have a linux server with over 8000 PDf's and need to know which PDF's have been ocr'd and which one's haven't.

Was thinking some sort of script calling XPDF to check the pdf but to be honest not sure if this is possible

Thanks in advance for any help

2

There are 2 answers

0
Kurt Pfeifle On

Make sure you have a commandline tool pdffonts installed. (There are two versions of this: one ships as part of the xpdf-utils, the other as part of the poppler-utils.)

All PDFs which consist of scanned pages only will not have any fonts used (neither embedded ones, nor un-embedded ones).

The commandline

pdffonts /path/to/scanned.pdf

will then not show any font information for that file.

This may already be enough for you to separate your files into two different sets.

If you have PDFs which have a mix of scanned pages and "normal" pages (or sanned-and-ocr'ed pages) then you'll have to extend and refine the above simplistic approach. See man pdffonts or pdffonts --help for more info.

1
Nathaniel M. Beaver On

The trouble with pdffonts is that sometimes it returns nothing, like this:

name                                 type              emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------

And sometimes it returns this:

name                                 type              emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
[none]                               Type 3            yes no  no     266  0
[none]                               Type 3            yes no  no       9  0
[none]                               Type 3            yes no  no     297  0
[none]                               Type 3            yes no  no     341  0
[none]                               Type 3            yes no  no     381  0
[none]                               Type 3            yes no  no     394  0
[none]                               Type 3            yes no  no     428  0
[none]                               Type 3            yes no  no     441  0
[none]                               Type 3            yes no  no     451  0
[none]                               Type 3            yes no  no     480  0
[none]                               Type 3            yes no  no     492  0
[none]                               Type 3            yes no  no     510  0
[none]                               Type 3            yes no  no     524  0
[none]                               Type 3            yes no  no     560  0
[none]                               Type 3            yes no  no     573  0
[none]                               Type 3            yes no  no     584  0
[none]                               Type 3            yes no  no     593  0
[none]                               Type 3            yes no  no     601  0
[none]                               Type 3            yes no  no     644  0

With that in mind, let's write a little text tool to get all the fonts from a pdf:

pdffonts my-doc.pdf | tail -n +3 | cut -d' ' -f1 | sort | uniq

If your pdf is not OCR'ed, this will output nothing or [none].

If you want it to run faster, use the -l flag to only analyze, say, the first 5 pages:

pdffonts -l 5 my-doc.pdf | tail -n +3 | cut -d' ' -f1 | sort | uniq

Now wrap it in a bash script, e.g. is-pdf-ocred.sh:

#!/bin/bash
MYFONTS=$(pdffonts -l 5 "$1" | tail -n +3 | cut -d' ' -f1 | sort | uniq)
if [ "$MYFONTS" = '' ] || [ "$MYFONTS" = '[none]' ]; then
    echo "NOT OCR'ed: $1"
else 
    echo "$1 is OCR'ed."
fi 

Finally, we want to be able to search for pdfs. The find command does not know about your aliases or functions in .bashrc, so we need to give it the path to the script. Run it in your directory of choice like so:

find . -type f -name "*.pdf" -exec /path/to/my/script/is-pdf-ocred.sh '{}' \;

I'm assuming that the pdf files end in .pdf, although this is not always an assumption you can make. You will probably want to pipe it to less or output it into a text file:

find . -type f -name "*.pdf" -exec /path/to/my/script/is-pdf-ocred.sh '{}' \; | less
find . -type f -name "*.pdf" -exec /path/to/my/script/is-pdf-ocred.sh '{}' \; > pdfs.txt

I was able to do about 200 pdfs in a little more than 10 seconds using the -l 5 flag.