Extract text from PDF documents and generate structured data

2.2k views Asked by At

I am able to extract the text from all pages in pdf successfully. But unable to generate in structured data. Guide me if anyone come across such expertise.

Code:

package pdfboxreadfromfile;

import java.awt.geom.Rectangle2D;
import java.io.File;
import java.io.IOException;
import java.util.List;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDDocumentCatalog;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm;
import org.apache.pdfbox.pdmodel.interactive.form.PDField;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.PDFTextStripperByArea;

public class PDFBoxReadFromFile {
  public static void main(String[] args) {
    try {
      File file = new File("C:/ma.pdf");
      PDDocument doc = PDDocument.load(file);
      PDFTextStripper pdfTextStripper = new PDFTextStripper();
      pdfTextStripper.setSortByPosition(true);
      pdfTextStripper.setStartPage(1);
      pdfTextStripper.setEndPage(6);
      String text = pdfTextStripper.getText(doc);
      System.out.println(text);
      doc.close();
    } catch (IOException e) {
      e.printStackTrace();
    }
  }

Output:

enter image description here

PDF looks like this. Page 1: enter image description here

Expected header text is only for reference and need notin print. enter image description here

Tried the following:

Pattern p = Pattern.compile("PO...........*?");
Pattern p1 = Pattern.compile("Vendor...........");
Pattern p2 = Pattern.compile("100.....*?");
Pattern p4 = Pattern.compile("Date...............................................*?");
Pattern p5 = Pattern.compile("62...........3*?");
Pattern p6 = Pattern.compile("62710149950...*?");
Pattern p7 = Pattern.compile("627101499504..*?");

Matcher m = p.matcher(text);
Matcher m1 = p1.matcher(text);
Matcher m2 = p2.matcher(text);
Matcher m4 = p4.matcher(text);
Matcher m5 = p5.matcher(text);
Matcher m6 = p6.matcher(text);
Matcher m7 = p7.matcher(text);
m.find();
m1.find();
m2.find();
m4.find();
m5.find();
m6.find();
m7.find();

System.out.println(m.group(0) + "|" + m1.group(0) + "|" + m2.group(0) + "|" + m2.group(0) + "|" + "MAC" + "|" + m4.group(0) + "|" + m5.group(0) + "|");
System.out.println(m.group(0) + "|" + m1.group(0) + "|" + m2.group(0) + "|" + m2.group(0) + "|" + "MAC" + "|" + m4.group(0) + "|" + m6.group(0) + "|");
System.out.println(m.group(0) + "|" + m1.group(0) + "|" + m2.group(0) + "|" + m2.group(0) + "|" + "MAC" + "|" + m4.group(0) + "|" + m7.group(0) + "|");

Structured Output. But issue is Quantity against the Barcode alias Product code is not coming. enter image description here

1

There are 1 answers

0
majster On

You should search the text for the header line (Barcode, Item number, ...) and then parse each following line by splitting it to columns. The columns are separated by spaces, so you can use the String.split() function.