How to scrape data from a web page using SAS

1.3k views Asked by At

Problem statement: I am required to get the data from web and put it into SAS dataset by using SAS Program.

Worked well: I am able to fetch the contents of the target web page by SAS.

Not working (Need Help): I am not able to process the source content of the page (shown below) by SAS. I need to find "Category" in the source content and if found then get all the values for that row (NOV,OCT,SEP,AUG,JUL). Same way, I need to find "Conference Board" in the source content and if found then get all the values for that row (96.1,101.4,101.3,86.3,91.7). The source content structure is expected to remain same always. Expected Output is attached as png image. It would be a great learning and help if someone knows and help me how to tackle this scenario by using SAS program.

I tried something like this:

filename output "/desktop/Response.txt";

proc http           
url="http://hosting.briefing.com/cschwab/Calendars/EconomicReleases/conf.htm"       
method="get"        
proxyhost="&proxy_host."        
proxyport=&port         
out=output;     
run;

DATA CHECK;
LENGTH CATEGORY $ 5;
RETAIN CATEGORY;
INFILE output LENGTH = recLen LRECL = 32767;
INPUT line $VARYING32767. recLen;
IF FIND(line,'Category') GT 0 THEN DO;
CATEGORY = SCAN(STRIP(line),2,'<>');
OUTPUT;
END;
RUN;

Source content of the web page:

<table width="100%" cellpadding="2" cellspacing="0" border="0">
  <tr valign="top" align="right" class="sectionColor">
    <td class="rH" align="left">Category</td>
    <td class="rH">NOV</td>
    <td class="rH">OCT</td>
    <td class="rH">SEP</td>
    <td class="rH">AUG</td>
    <td class="rH">JUL</td>
  </tr>
  <tr valign="top" align="right" class="sectionColor">
    <td class="rD" align="left">Conference Board</td>
    <td class="rD">96.1</td>
    <td class="rD">101.4</td>
    <td class="rD">101.3</td>
    <td class="rD">86.3</td>
    <td class="rD">91.7</td>
  </tr>
 <tr valign="top" align="right" class="sectionColor">
    <td class="rL" align="left">&nbsp;&nbsp;Expectations</td>
    <td class="rL">89.5</td>
    <td class="rL">98.2</td>
    <td class="rL">102.9</td>
    <td class="rL">86.6</td>
    <td class="rL">88.9</td>
  </tr>
  <tr valign="top" align="right" class="sectionColor">
    <td class="rD" align="left">&nbsp;&nbsp;Present Situation</td>
    <td class="rD">105.9</td>
    <td class="rD">106.2</td>
    <td class="rD">98.9</td>
    <td class="rD">85.8</td>
    <td class="rD">95.9</td>
  </tr>
  <tr valign="top" align="right" class="sectionColor">
    <td class="rL" align="left">Employment  ('plentiful' less 'hard to get')</td>
    <td class="rL">7.2</td>
    <td class="rL">7.1</td>
    <td class="rL">3.3</td>
    <td class="rL">-2.2</td>
    <td class="rL">2.2</td>
  </tr>
  <tr valign="top" align="right" class="sectionColor">
    <td class="rD" align="left">1 yr inflation expectations</td>
    <td class="rD">5.7%</td>
    <td class="rD">5.6%</td>
    <td class="rD">5.7%</td>
    <td class="rD">5.8%</td>
    <td class="rD">6.1%</td>
  </tr>
</table>

Output of SAS dataset should be like:

3

There are 3 answers

1
Richard On

For most scraping you want to use a parsing library or table ripping tool.

However, simple text processing can be used when the html can be relied upon as highly regular or patterned.

Example:

filename output temp;

proc http           
url="http://hosting.briefing.com/cschwab/Calendars/EconomicReleases/conf.htm"       
method="get"        
/*proxyhost="&proxy_host."        */
/*proxyport=&port         */
out=output;     
run;

data have_cells (keep=row_num name value);
  length name value $32;

  infile output _infile_=line;
  input;

  retain landmark_found 0 table_found 0 naming 1 in_row 0 row_num -1;

  if not landmark_found then  
    landmark_found = prxmatch('/Highlights/',  line);

  if not landmark_found then
    delete;

  if not table_found then
    table_found = prxmatch('/<table /', line);

  if not table_found then
    delete;

  array names(20) $8 _temporary_;

  if not in_row then
    if prxmatch('/<tr /', line) then do;
      col_index = 0;
      in_row = 1;
      row_num + 1;
      return;
    end;

  if not in_row then
    delete;
td:
  rxtd = prxparse('/<td .*?>(.*)<\/td>/');
  if prxmatch(rxtd, line) then do;
    col_index + 1;
    if naming then do;
      names(col_index) = prxposn(rxtd,1,line);
    end;
    else do;
      name = names(col_index);
      value = prxposn(rxtd,1,line);
      OUTPUT;
    end;
    return;
  end;

  in_row = not prxmatch('/<\/tr/', line);

  if naming then if not in_row then naming = 0;

  if prxmatch('/<\/table>/', line) then stop;
run;

proc transpose data=have_cells out=have_raw;
  by row_num;
  id name;
  var value;
run;

enter image description here

You would have to perform more coding to

  • htmldecode() certain columns
  • right size character columns
  • convert others
    • character to numeric
    • character to date
1
Tom On

IF (and that is a BIG if with HTML pages) the file stays as neat as that example you can parse it pretty easily. Just look for the <table>, <tr> and <td> tags. Make your own counters for table number, row number, column number.

filename output url 
  "http://hosting.briefing.com/cschwab/Calendars/EconomicReleases/conf.htm"
;

data tall;
 length table row col 8 value $200;
 infile output truncover ;
 do until(left(_infile_)=:'<table'); input ; end;
 table+1;
 row=0;
 do until(left(_infile_)=:'</table');
   input;
   if left(_infile_)=:'<tr' then do; row+1; col=0; end;
   if left(_infile_)=:'<td' then do; 
      value = left(scan(_infile_,-2,'<>')); 
      col+1; 
      if value ne ' ' then output; 
   end;
 end;
run;

Then use PROC TRANSPOSE to make a rectangular structure. Looks like you probably don't want the first table.

proc transpose data=tall out=tables(drop=_name_) prefix=col;
  where table=2;
  by table row;
  id col;
  var value;
run;

enter image description here

But you probably want to transpose that. So re-sort before the PROC TRANSPOSE step.

proc sort data=tall;
  by table col row;
run;
proc transpose data=tall out=tables(drop=_name_) prefix=col;
  where table=2;
  by table col;
  id row;
  var value;
run;

Results enter image description here

0
Joe On

While SAS doesn't have the absolute best HMTL parsing options available unfortunately, you do have a few options that can work short of parsing it from scratch.

First, you can treat well-formed HTML like XML, sometimes. Yours is well formed enough to nearly do that; the only problem is the   and the lack of tag at the start, which is easily remedied.

The best way to do this is usually to copy the text out to a file, and then open the SAS XML Mapper utility. That will let you get the log messages that let you know what's wrong with it so you can fix it.

Once you've fixed whatever minor issues there were, you can simply use:

filename xml_file '<your html-as-xml file path>';

filename map temp;
  
libname xml_file xmlv2 automap=replace xmlmap=map;

That map could then be used to further parse the document. This gives you a libname "xml_file" which then has members "table", "tr", and "td", which you can relate to each other using ordinals (basically, auto-incrementing index values).

Your second option, which I would recommend for most cases, is to farm out the parsing to some other language. Python has a great parser, BeautifulSoup, which in SAS 9.4 TS1M6 could be used from inside SAS via the FCMP engine; or, if you're familiar with Groovy, PROC GROOVY also is pretty good at parsing HTML. You could either write a separate program in the other language, or integrate them; SAS integrates well with several languages (Python and R, for starters, both of which have good parsers available).