I have thousands of searchable PDFs, some of which are up to a 1GB with over 2000 pages. I need to be able to search for a text string in these files using a Node.js app.
Right now, files are stored in a Google Cloud Storage bucket.
What's the best way to do this?
Some options:
- Read the text from PDF files into MySQL using something like NPM
package
pdf-text-extract
. Then use MySQL queries to search for text strings. - Search the PDF files directly using some NPM package.
Am I completely off? Is there a better way?
There are dedicated text search libraries out there, like this one, or this. Most likely you'd need to extract plain text from each pdf, save and index them. Then you'll be able to run search queries. Setting up database for this particular task may be an overkill.