How to read data from a PDF using SAS Program

2.9k views Asked by At

Problem Statement: I am unable to read data from a PDF file using SAS.

What worked well: I am able to download the PDF from the website and save it.

Not working (Need Help): I am not able to read the data from a PDF file using SAS. The source content structure is expected to remain the same always. Expected Output is attached as a jpg image.

It would be a great learning and help if someone knows and help me how to tackle this scenario by using SAS program. Below image is the source in PDF format and Same is Expected Result in SAS dataset format:

I tried something like this:

/*Proxy address*/
%let proxy_host=xxx.com;
%let port=123;

/*Output location*/
filename output "/desktop/Response.pdf";

/*Download the source file and save it in the desired location*/
proc http           
url="https://cdn.nar.realtor/sites/default/files/documents/ehs-10-2020-overview-2020-11-19_0.pdf"       
method="get"        
proxyhost="&proxy_host."        
proxyport=&port         
out=output;     
run;

%let lineSize = 2000;

data base;
   format text_line $&lineSize..;
   infile output lrecl=&lineSize;
   input text_line $;
run;

DATA _NULL_ ;
X "PS2ASCII /desktop/Response.pdf
/desktop/flatfile.txt";
RUN;
1

There are 1 answers

8
Richard On

You can use Apache PDFBox® library which is an open source Java tool for working with PDF documents. The library can be utilized from within SAS Proc GROOVY with Java code that strips text and it's position on page from a PDF document.

Example:

You will have to write more code to make a data set from the stripped text.

filename overview "overview.pdf";
filename ov_text  "overview.txt";

* download a pdf document;

proc http           
url="https://cdn.nar.realtor/sites/default/files/documents/ehs-10-2020-overview-2020-11-19_0.pdf"       
method="get"        
/*proxyhost="&proxy_host."        */
/*proxyport=&port         */
out=overview;     
run;

* download the Apache PDFBox library (a .jar file); 

filename jar 'pdfbox.jar';

%if %sysfunc(FEXIST(jar)) ne 1 %then %do;
  proc http
    url='https://www.apache.org/dyn/closer.lua?filename=pdfbox/2.0.21/pdfbox-app-2.0.21.jar&action=download'
    out=jar;
  run;
%end;

* Use GROOVY to read the PDF, strip out the text and position, and write that
* parse to a text file which SAS can read;

proc groovy classpath="pdfbox.jar"; 
  submit 
    "%sysfunc(pathname(overview))"  /* the input, a pdf file */
    "%sysfunc(pathname(ov_text))"   /* the output, a text file */
  ;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.TextPosition;
import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.io.Writer;
import java.util.ArrayList;
import java.util.List;
import java.io.FileWriter;
import java.io.PrintWriter;

public class GetLinesFromPDF extends PDFTextStripper {
    
    static List<String> lines = new ArrayList<String>();
    public GetLinesFromPDF() throws IOException {
    }
    /**
     * @throws IOException If there is an error parsing the document.
     */
    public static void main( String[] args ) throws IOException {
        PDDocument document = null;
        PrintWriter out = null;
        String inPdf = args[0];
        String outTxt = args[1];

        try {
            document = PDDocument.load( new File(inPdf) );

            PDFTextStripper stripper = new GetLinesFromPDF();

            stripper.setSortByPosition( true );
            stripper.setStartPage( 0 );
            stripper.setEndPage( document.getNumberOfPages() );

            Writer dummy = new OutputStreamWriter(new ByteArrayOutputStream());
            stripper.writeText(document, dummy);
            
            out = new PrintWriter(new FileWriter(outTxt));

            // print lines to text file
            for(String line:lines){
              out.println(line); 
            }
        }
        finally {
            if( document != null ) {
                document.close();
            }
            if( out != null ) {
                out.close();
            }
        }
    }
    /**
     * Override the default functionality of PDFTextStripper.writeString()
     */
    @Override
    protected void writeString(String str, List<TextPosition> textPositions) throws IOException {
        String places = "";

        for(TextPosition tp:textPositions){
          places += "(" + tp.getX() + "," + tp.getY() + ") ";
        }

        lines.add(str + " found @ " + places);
    }
}

  endsubmit;
quit;

* preview the stripped text that was saved;

data _null_;
  infile ov_text;
  input;
  putlog _infile_;
run;

/*
 * additional SAS code will be needed to input the text as data 
 * and construct a data set that matches the original tabular content layout
 */