Java - Text Extraction from PDF using OCR

Question

Java - Text Extraction from PDF using OCR

14.9k views Asked by Dax Amin At 16 April 2016 at 07:58

I have a pdf file (some part of it given below), and want to extract text from it. I have used PDFTextStream, but it doesn't work with this file. (However it worked with other file, that has simple text).

What other OCR libraries are capable of doing it?

Please Help. Thank you.

Original Q&A

There are 1 answers

**Dax Amin** · Answer 1 · 2016-04-16T11:16:28+00:00

I tried with PDFBox and it produced satisfactory results.

Here is the code to extract text from PDF using PDFBox:

import java.io.*;

import org.apache.pdfbox.pdmodel.*;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.util.*;

public class PDFTest {

 public static void main(String[] args){
 PDDocument pd;
 BufferedWriter wr;
 try {
         File input = new File("C:/BillOCR/data/bill.pdf");  // The PDF file from where you would like to extract
         File output = new File("D:/SampleText.txt"); // The text file where you are going to store the extracted data
         pd = PDDocument.load(input);
         System.out.println(pd.getNumberOfPages());
         System.out.println(pd.isEncrypted());
         pd.save("CopyOfBill.pdf"); // Creates a copy called "CopyOfInvoice.pdf"
         PDFTextStripper stripper = new PDFTextStripper();
         stripper.setStartPage(1); //Start extracting from page 3
         stripper.setEndPage(1); //Extract till page 5
         wr = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(output)));
         stripper.writeText(pd, wr);
         if (pd != null) {
             pd.close();
         }
        // I use close() to flush the stream.
        wr.close();
 } catch (Exception e){
         e.printStackTrace();
        }
     }
}

TechQA.

Java - Text Extraction from PDF using OCR

There are 1 answers

Related Questions in JAVA

Related Questions in PDF

Related Questions in PDFBOX

Related Questions in TEXT-EXTRACTION

Related Questions in PDFTEXTSTREAM

Popular Questions

Popular Tags

Trending Questions