I have over 1000 PDF files and need to extract text from them and load into a .txt file. I could get the code for a single PDF file, but not successful from multiple PDFs. My code as below -
Main
package pdftest;`
import java.io.File;
import java.io.IOException;
public class JavaPDFTest {
public static void main(String[] args) throws IOException {
String path = "C:\\Users\\arunk01\\Desktop\\Java_Extraction\\";
String files;
File folder = new File(path);
File[] listOfFiles = folder.listFiles();
for (int i = 0; i < listOfFiles.length; i++)
{
if (listOfFiles[i].isFile())
{
files = listOfFiles[i].getName();
if (files.endsWith(".pdf") || files.endsWith(".PDF"))
{
System.out.println(files);
String nfiles = "C:\\Users\\arunk01\\Desktop\\Java_Extraction\\";
PDFManager pdfManager = new PDFManager();
String pdfToText = pdfManager.pdftoText(nfiles+files);
if (pdfToText == null) {
System.out.println("PDF to Text Conversion failed.");
}
else {
System.out.println("\nThe text parsed from the PDF Document....\n" + pdfToText);
pdfManager.writeTexttoFile(pdfToText,nfiles+files+".txt");
}
}
}
}
}
}
Class
package pdftest;
import java.io.File;
import java.io.IOException;
import org.apache.pdfbox.cos.COSDocument;
import org.apache.pdfbox.io.RandomAccessFile;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
public class PDFManager {
private PDFParser parser;
private PDFTextStripper pdfStripper;
private PDDocument pdDoc ;
private COSDocument cosDoc ;
private String pdftoText;
private String Text ;
private String filePath;
private File file;
public PDFManager() {
}
public String ToText() throws IOException
{
this.pdfStripper = null;
this.pdDoc = null;
this.cosDoc = null;
file = new File(filePath);
parser = new PDFParser(new RandomAccessFile(file,"r")); // update for PDFBox V 2.0
parser.parse();
cosDoc = parser.getDocument();
pdfStripper = new PDFTextStripper();
pdDoc = new PDDocument(cosDoc);
pdDoc.getNumberOfPages();
pdfStripper.setStartPage(1);
// pdfStripper.setEndPage(10);
// reading text from page 1 to 10
// if you want to get text from full pdf file use this code
pdfStripper.setEndPage(pdDoc.getNumberOfPages());
Text = pdfStripper.getText(pdDoc);
return Text;
}
public void setFilePath(String filePath) {
this.filePath = filePath;
}
public String pdftoText(String string) {
// TODO Auto-generated method stub
return Text;
}
public void writeTexttoFile(String pdfToText2, String string) {
// TODO Auto-generated method stub
}
}
I am not getting any error, but it says PDF to Text conversion failed (hits the if condition in Main)
2016__00002685__00.PDF
PDF to Text Conversion failed.
2016__00002685__01.PDF
PDF to Text Conversion failed.
2016__100018__00.PDF
PDF to Text Conversion failed.
2016__100018__01.PDF
PDF to Text Conversion failed.
Can some one help me with the code to convert multiple PDFs to text.
Thanks, Arun
pdftoText
method inPDFManager
class returns text which is null. You need to invokeToText
method. Try this: