Problem statement: I am required to get the data from web and put it into SAS dataset by using SAS Program.
Worked well: I am able to fetch the contents of the target web page by SAS.
Not working (Need Help): I am not able to process the source content of the page (shown below) by SAS. I need to find "Category" in the source content and if found then get all the values for that row (NOV,OCT,SEP,AUG,JUL). Same way, I need to find "Conference Board" in the source content and if found then get all the values for that row (96.1,101.4,101.3,86.3,91.7). The source content structure is expected to remain same always. Expected Output is attached as png image. It would be a great learning and help if someone knows and help me how to tackle this scenario by using SAS program.
I tried something like this:
filename output "/desktop/Response.txt";
proc http
url="http://hosting.briefing.com/cschwab/Calendars/EconomicReleases/conf.htm"
method="get"
proxyhost="&proxy_host."
proxyport=&port
out=output;
run;
DATA CHECK;
LENGTH CATEGORY $ 5;
RETAIN CATEGORY;
INFILE output LENGTH = recLen LRECL = 32767;
INPUT line $VARYING32767. recLen;
IF FIND(line,'Category') GT 0 THEN DO;
CATEGORY = SCAN(STRIP(line),2,'<>');
OUTPUT;
END;
RUN;
Source content of the web page:
<table width="100%" cellpadding="2" cellspacing="0" border="0">
<tr valign="top" align="right" class="sectionColor">
<td class="rH" align="left">Category</td>
<td class="rH">NOV</td>
<td class="rH">OCT</td>
<td class="rH">SEP</td>
<td class="rH">AUG</td>
<td class="rH">JUL</td>
</tr>
<tr valign="top" align="right" class="sectionColor">
<td class="rD" align="left">Conference Board</td>
<td class="rD">96.1</td>
<td class="rD">101.4</td>
<td class="rD">101.3</td>
<td class="rD">86.3</td>
<td class="rD">91.7</td>
</tr>
<tr valign="top" align="right" class="sectionColor">
<td class="rL" align="left"> Expectations</td>
<td class="rL">89.5</td>
<td class="rL">98.2</td>
<td class="rL">102.9</td>
<td class="rL">86.6</td>
<td class="rL">88.9</td>
</tr>
<tr valign="top" align="right" class="sectionColor">
<td class="rD" align="left"> Present Situation</td>
<td class="rD">105.9</td>
<td class="rD">106.2</td>
<td class="rD">98.9</td>
<td class="rD">85.8</td>
<td class="rD">95.9</td>
</tr>
<tr valign="top" align="right" class="sectionColor">
<td class="rL" align="left">Employment ('plentiful' less 'hard to get')</td>
<td class="rL">7.2</td>
<td class="rL">7.1</td>
<td class="rL">3.3</td>
<td class="rL">-2.2</td>
<td class="rL">2.2</td>
</tr>
<tr valign="top" align="right" class="sectionColor">
<td class="rD" align="left">1 yr inflation expectations</td>
<td class="rD">5.7%</td>
<td class="rD">5.6%</td>
<td class="rD">5.7%</td>
<td class="rD">5.8%</td>
<td class="rD">6.1%</td>
</tr>
</table>
Output of SAS dataset should be like:
For most scraping you want to use a parsing library or table ripping tool.
However, simple text processing can be used when the html can be relied upon as highly regular or patterned.
Example:
You would have to perform more coding to
htmldecode()
certain columns