SAS to R conversion of merge

632 views Asked by At

I am currently working on converting a SAS macro into a R code. I have worked a lot on R but I am relatively new to SAS. I am having trouble understanding the SAS code for a merge command -

data dates;
merge A(keep=date rename=(date=beg))
A(keep=date firstobs= 5 rename=(date=end))
A(keep=date firstobs= 10 rename=(date=bega))
A(keep=date firstobs= 15 rename=(date=ee))
A(keep=date firstobs= 30 rename=(date=eend));
index+1;
if nmiss(beg,end,bega,eend,ee)=0; 
run;

I understand that this command is joining the file A to itself 5 times. But I am not able to visualize the output. What does 'index+1' and 'if' stand for. What is the R version for this code?

3

There are 3 answers

0
Yick Leung On

I'm not quite familiar with R, but I know some SAS. I'm not sure if I would call this a macro... The output of your merged data set will depend on how your input data set looks like. Just run your code, and you'll be able to see it in your work folder...

Generally, the data step is structured like an implicit loop. The index+1 looks like the sum statement with the syntax: variable+expression. In this case, the value of index after +1 will be retained for another iteration.

The if statement here contains a boolean condition (i.e. it can have the value of either True or False, but not both) to set a constraint when outputting the data step. If it's true, the current row of data will be outputted. nmiss(var1,var2,var3,...) is a function that will return the number of arguments specified inside nmiss() that are missing. E.g. if only var1 is missing, nmiss(var1,var2,var3,...) = 1.

1
BellevueBob On

As Yick says, the index+1 statement creates a new variable in your output data set that begins with one and increments for each observation processed.

The nmiss(...) function used like this is called a sub-setting IF expression, meaning that observations having a non-zero result (no missing values) are not written out to your final dataset.

The best way to visualize the results will be for you to run this code twice using a small test dataset, once using that if statement and once without. For example:

data a;
  do i=1 to 50;
     date = today() + i;
     output;
     end;
run;

data dates1;
  merge A(keep=date rename=(date=beg))
        A(keep=date firstobs= 5 rename=(date=end))
        A(keep=date firstobs= 10 rename=(date=bega))
        A(keep=date firstobs= 15 rename=(date=ee))
        A(keep=date firstobs= 30 rename=(date=eend));
  index+1;
  if nmiss(beg,end,bega,eend,ee)=0;
  format beg end bega ee eend yymmdd10.;
run;

data dates2;
  merge A(keep=date rename=(date=beg))
        A(keep=date firstobs= 5 rename=(date=end)) 
        A(keep=date firstobs= 10 rename=(date=bega))
        A(keep=date firstobs= 15 rename=(date=ee))
        A(keep=date firstobs= 30 rename=(date=eend));
  index+1;
  format beg end bega ee eend yymmdd10.;
run;

After running the above, open both datasets in SAS and compare them side-by-side. The effect of the subsetting-IF statement should be obvious, as well as probably help you understand why this was done (a clever trick, by the way). I added a FORMAT statement to make it a bit easier to see.

0
DomPazz On

It's been a while since I wrote R (so this might not be the best code), but this would be roughly equivalent to

n = nrow(a)
dates = data.frame(cbind(
    1:(n-29),
    a[1:(n-29),"date"],
    a[5:(n-25),"date"],
    a[10:(n-20),"date"],
    a[15:(n-15),"date"],
    a[30:n,"date"]
))

names(dates) = c("index","beg","end","bega","ee","eend")

As you said, you are merging A onto itself 5 times. As others have said, the index+1 statement simply acts as a row index count. The if nmiss(...)=0; statement means you only get rows where everything lines up.

So use the cbind() function in R to do the merge. cbind() requires that you have like lengths on the inputs so you have to adjust your ranges. These ranges are the equivalent to the firstobs= option on the input Data Set plus the subsetting if ... ; statement.