Is there a way to extract text from PDFs in nodejs without any OS dependencies (like pdf2text, or xpdf on windows)? I wasn't able to find any 'native' pdf packages in nodejs. They always are a wrapper/util on top of an existing OS command. Thanks
PDF to Text extractor in nodejs without OS dependencies
27.3k views Asked by bartium AtThere are 5 answers
Instead of using the proposed PDF2Json you can also use PDF.js directly (https://github.com/mozilla/pdfjs-dist). This has the advantage that you are not depending on modesty who owns PDF2Json and that he updates the PDF.js base.
Thought I'd chime in here for anyone who came across this question in the future. I had this problem and spent hours over literally all the PDF libraries on NPM. My requirements were that I needed to run it on AWS Lambda so could not depend on OS dependencies.
The code below is adapted from another stackoverflow answer (which I cannot currently find). The only difference being that we import the ES5 version which works with Node >= 12. If you just import pdfjs-dist there will be an error of "Readable Stream is not defined". Hope it helps!
import * as pdfjslib from 'pdfjs-dist/es5/build/pdf.js';
export default class Pdf {
public static async getPageText(pdf: any, pageNo: number) {
const page = await pdf.getPage(pageNo);
const tokenizedText = await page.getTextContent();
const pageText = tokenizedText.items.map((token: any) => token.str).join('');
return pageText;
}
public static async getPDFText(source: any): Promise<string> {
const pdf = await pdfjslib.getDocument(source).promise;
const maxPages = pdf.numPages;
const pageTextPromises = [];
for (let pageNo = 1; pageNo <= maxPages; pageNo += 1) {
pageTextPromises.push(Pdf.getPageText(pdf, pageNo));
}
const pageTexts = await Promise.all(pageTextPromises);
return pageTexts.join(' ');
}
}
Usage
const fileBuffer = fs.readFile('sample.pdf');
const pdfText = await Pdf.getPDFText(fileBuffer);
After some work, I finally got a reliable function for reading text from PDF using https://github.com/mozilla/pdfjs-dist
To get this to work, first npm install on the command line:
npm i pdfjs-dist
Then create a file with this code (I named the file "pdfExport.js" in this example):
const pdfjsLib = require("pdfjs-dist");
async function GetTextFromPDF(path) {
let doc = await pdfjsLib.getDocument(path).promise;
let page1 = await doc.getPage(1);
let content = await page1.getTextContent();
let strings = content.items.map(function(item) {
return item.str;
});
return strings;
}
module.exports = { GetTextFromPDF }
Then it can simply be used in any other js file you have like so:
const pdfExport = require('./pdfExport');
pdfExport.GetTextFromPDF('./sample.pdf').then(data => console.log(data));
This solution worked for me using node 14.20.1 using "pdf-parse": "^1.1.1"
You can install it with:
yarn add pdf-parse
This is the main function which converts the PDF file to text.
const path = require('path');
const fs = require('fs');
const pdf = require('pdf-parse');
const assert = require('assert');
const extractText = async (pathStr) => {
assert (fs.existsSync(pathStr), `Path does not exist ${pathStr}`)
const pdfFile = path.resolve(pathStr)
const dataBuffer = fs.readFileSync(pdfFile);
const data = await pdf(dataBuffer)
return data.text
}
module.exports = {
extractText
}
Then you can use the function like this:
const { extractText } = require('../api/lighthouse/lib/pdfExtraction')
extractText('./data/CoreDeveloper-v5.1.4.pdf').then(t => console.log(t))
Have you checked PDF2Json? It is built on top of PDF.js. Though it is not providing the text output as a single line but I believe you may just reconstruct the final text based on the generated Json output: