Problem Statement: I am unable to read data from a PDF file using SAS.
What worked well: I am able to download the PDF from the website and save it.
Not working (Need Help): I am not able to read the data from a PDF file using SAS. The source content structure is expected to remain the same always. Expected Output is attached as a jpg image.
It would be a great learning and help if someone knows and help me how to tackle this scenario by using SAS program.
I tried something like this:
/*Proxy address*/
%let proxy_host=xxx.com;
%let port=123;
/*Output location*/
filename output "/desktop/Response.pdf";
/*Download the source file and save it in the desired location*/
proc http
url="https://cdn.nar.realtor/sites/default/files/documents/ehs-10-2020-overview-2020-11-19_0.pdf"
method="get"
proxyhost="&proxy_host."
proxyport=&port
out=output;
run;
%let lineSize = 2000;
data base;
format text_line $&lineSize..;
infile output lrecl=&lineSize;
input text_line $;
run;
DATA _NULL_ ;
X "PS2ASCII /desktop/Response.pdf
/desktop/flatfile.txt";
RUN;
You can use Apache PDFBox® library which is an open source Java tool for working with PDF documents. The library can be utilized from within SAS
Proc GROOVY
with Java code that strips text and it's position on page from a PDF document.Example:
You will have to write more code to make a data set from the stripped text.