how can i create a variable depends on other observations

99 views Asked by At

I have a college data, problem is finding if the same id, same course1 and course2 have next term or not. I have the table as below first 4 variables: id, term, course1 and course2. I am trying to create 5th variable 'nextterm'. Terms are like this: 201010-201020-201030-201110-201120-201130-201210,... so on. So id=21, course1 and course 2 are same MAT 51 for 201010 AND 201020. So 201010 is going to be Yes, 201020 is going to be No.

  id    term    course1    course2                nextterm
  21    201010  MAT        41                        No
  21    201010  MAT        51                        Yes
  21    201020  MAT        51                        No
  21    201020  SPC        13                        No
  29    201130  pos        94                        Yes
  29    201210  pos        94                        No 
2

There are 2 answers

13
Haikuo Bian On BEST ANSWER

Ok, this is the updated answer, it requires some understanding of SQL at certain level, it works if your data volume is not too large. Please note, I have converted the term into number in the incoming dataset, but it can also be done on the fly. If you feel it runs too slow, then I suggest 1. resort your data by decending term within id, 2. DOW or Hash should be sufficient.

    data have;
    input id    term    (course1    course2          ) (:$8.);
    cards;
  21    201010  MAT        41                  
  21    201010  MAT        51                  
  21    201020  MAT        51                  
  21    201020  SPC        13                  
  29    201120  pos        94                  
  29    201130  pos        94                  
;

   PROC SQL;
CREATE TABLE WANT AS
SELECT *, CASE WHEN EXISTS(SELECT * FROM HAVE WHERE ID=A.ID AND 
(INT(TERM/100) = INT(A.TERM/100) AND MOD(TERM,100)=MOD(A.TERM,100)+10 
 OR INT(TERM/100) = INT(A.TERM/100)+1 AND MOD(TERM,100)=MOD(A.TERM,100)-20) 
 AND CATS(A.COURSE1,A.COURSE2) = CATS(COURSE1,COURSE2)) 
                  THEN 'Yes' ELSE 'No' END AS NEXTTERM
FROM HAVE A;
QUIT;

Here is a better solution. It sets up a Hash object in the first DOW to store all of the term, course information within the same ID, then in the second DOW to check if your condition meets. It does not require a sort if all of IDs stay together (clustered). For details, please refer to SAS Hash docs.

data have;
    input (id term course1 course2) (:$8.);
    cards;
21    201010  MAT        41                  
21    201010  MAT        51                  
21    201020  SPC        13 
21    201030  MAT        51 
21    201030  SPC        13 
29    201120  pos        94                  
29    201130  pos        94                  
;
run;



   data want;
if _n_=1 then do;
    dcl hash h();
    h.definekey('term','course1','course2');
    h.definedone();
end;
    do until (last.id);
        set have;
        by id notsorted;
        rc=h.add();
    end;

    length nextterm $3;

    do until (last.id);
        set have;
        by id notsorted;

        if h.check(key:cats(substr(term,1,4),input(substr(term,5),2.)+10),key:course1, key:course2) = 0 or 
            h.check(key:cats(input(substr(term,1,4),4.)+1,input(substr(term,5),2.)-20),key:course1, key:course2) =0 then
            nextterm='Yes';
        else nextterm = 'No';
        output;
    end;
    h.clear();
run;
5
user667489 On

Here's another approach you can use - sort the dataset into reverse order, check to see whether each course carries over between terms, then sort it back into the original order:

data have;
    input id term (course1 course2) (:$8.);
    cards;
21    201010  MAT        41                  
21    201010  MAT        51                  
21    201020  MAT        51                  
21    201020  SPC        13                  
29    201120  pos        94                  
29    201130  pos        94                  
;
run;

proc sort data = have;
    by id course1 course2 descending term;
run;

data want;
    set have;
    by id course1 course2;
    length nextterm $3;
    nextterm = ifc(first.id or first.course2, 'No','Yes');
run;

proc sort data = want;
    by id term course1 course2;
run;