Categorize one table based on contents of a column in another table

82 views Asked by At

I appreciate some help with this task, which I find a bit challenging I have two tables: 1- Table 1: Containing patients and their diagnosis coded according to ICD code system (the table has several thousand rows in reality)

Patient Diagnosis
JOHN    A1
ALEX    A12
ZAK K12
RICHARD A25
SAM G3
MANNY   H2

2- Table 2: basically contains a number of ICD diagnosis groups, for the sake of simplicity let us say we have three columns A, B, C each of which has rows representing a number of ICD diagnosis codes, here is a simplification:

Group A Group B Group C
A12 A21 C13
A25 A33 J12
H12 G1  A1
K12 K77 L5
G1  J12 A12
D3  A25 G3

What I would like to do is create a third table which is essentially Table 1 plus three more columns each for group A, B and C. The code would then go through every row and see if the ICD diagnosis in table 1 fits any of the diagnoses in Groups A, B and C and if so it would fill the corresponding cell with 1 otherwise 0. Here is the final table just to explain:

Diagnosis   Group A Group B Patient Group C
A1  0   0   JOHN    1
A12 1   0   ALEX    1
K12 1   0   ZAK 0
A25 1   1   RICHARD 0
G3  0   0   SAM 1
H2  0   0   MANNY   0

Could anyone please suggest how this can be done? The tables in reality are of course much much larger

3

There are 3 answers

0
bjaegers On

You could use a PROC SQL statement to accomplish this. I did something like this.

DATA TABLE1;
    INPUT PATIENT $ DIAGNOSIS $;
CARDS;
JOHN A1 
ALEX A12 
ZAK K12 
RICHARD A25 
SAM G3 
MANNY H2
;
RUN;

DATA TABLE2;
    INPUT GROUPA $ GROUPB $ GROUPC $;
CARDS;
A12 A21 C13
A25 A33 J12 
H12 G1 A1 
K12 K77 L5 
G1 J12 A12 
D3 A25 G3
;
RUN;

PROC SQL;
    CREATE TABLE RESULTSET AS
    SELECT 
        A.DIAGNOSIS,
        MAX(CASE WHEN B.GROUPA IS NULL THEN 0 ELSE 1 END) AS GROUPA,
        MAX(CASE WHEN C.GROUPB IS NULL THEN 0 ELSE 1 END) AS GROUPB,
        A.PATIENT,
        MAX(CASE WHEN D.GROUPC IS NULL THEN 0 ELSE 1 END) AS GROUPC
    FROM TABLE1 A
    LEFT JOIN TABLE2 B
        ON A.DIAGNOSIS = B.GROUPA
    LEFT JOIN TABLE2 C
        ON A.DIAGNOSIS = C.GROUPB
    LEFT JOIN TABLE2 D
        ON A.DIAGNOSIS = D.GROUPC
    GROUP BY 
        A.DIAGNOSIS, 
        A.PATIENT;
QUIT;
0
Haikuo Bian On

If resort to Data Step, here is one way (given your RAM is large enough to hold whole ICD codes dictionary), maybe more efficient than Proc SQL.

data have;
    input (Patient Diagnosis) (:$8.);
    cards;
JOHN    A1
ALEX    A12
ZAK K12
RICHARD A25
SAM G3
MANNY   H2
;

data codes;
    input (GroupA GroupB GroupC) (:$8.);
    cards;
A12 A21 C13
A25 A33 J12
H12 G1  A1
K12 K77 L5
G1  J12 A12
D3  A25 G3
;

data want;
    if _n_=1 then
        do;
            declare hash h();
            h.definekey('group','value');
            h.definedone();

            do until (last);
                set codes end=last;
                array grp _character_;

                do over grp;
                    group=vname(grp);
                    value=grp;
                    rc=h.replace();
                end;
            end;
        end;

    set have;
    grpa=not h.check(key:'GroupA', key:diagnosis);
    grpb=not h.check(key:'GroupB', key:diagnosis);
    grpc=not h.check(key:'GroupC', key:diagnosis);
    drop group: value rc;
run;
0
Shenglin Chen On

Efficiency and speed of array are lower than hash table, but it is seemly not limited by memory.

data want;
     set table1;
     if _n_=1 then do;
     do i=1 by 1 until(last);
     set table2 end=last;
     array dig[10000] $ _temporary_;
     array grp groupa--groupc;
     do j=1 to 3;
        n+1;
        dig(n)=grp(j);
     end;
     end;
     end;
     grpA=0;grpB=0;grpC=0;
     do i=1 to n;
     if DIAGNOSIS=dig(i) then do;  
        if mod(i,3)=1 then grpA=1;
        else if mod(i,3)=2  then grpB=1;
        else grpC=1;
     end;
     end;
    drop i j n groupA--GroupC;
run;