I've tried using Adobe Acrobat X Pro to "recognize text in multiple files."

When I start this process and it asks for the directory, I've chose C:, my main hard drive.

It took hours to load and when it did, the list of files it generated included word documents as well. Adobe said I couldn't proceed until I removed the problem files.

Once I removed all the pdfs Adobe flagged as having errors (like password protection) and the prompt remained, I assumed it meant the word documents in the list.

So I manually removed those too. But Adobe still said that I couldn't proceed until problem files were removed and there weren't any remaining files in the list that adobe had flagged as having issues.

My firm is trying to make sure all pdfs we have are searcheable. Currently, some are and some aren't. Our goal is to make them all searchable without removing them from their varied locations.

1

There are 1 answers

2
Joris Schellekens On BEST ANSWER

I think you can do this using a combination of

  • regular java : to list all files in a directory that match a given criterium (e.g. their name ends with '.pdf')
  • iText : to iterate over the PDF document and extract all images
  • Tess4J : a port of Tesseract (google OCR engine) for java, to turn the extracted images back into text

Unless I am much mistaken, Tesseract even offers a crude version of this workflow for you. But only for 1 pdf at a time. So you'd still need some windows/linux scripting to pipe in all files of a given directory.