how to display the total count of individual words from a list

820 views Asked by At

In the data for 10000 item_ids, the item description is given so how to count the frequency of individual word in the item description column, for a particular item_id, where the item_id are repeating, using SAS (without using array).

Goal is to identify the keywords for a particular item_id.

2

There are 2 answers

4
Longfish On

Arrays are used to read across multiple columns, so aren't of any particular use here. This does sound a bit like a homework question and you should really show some attempt that you've made. However, this is not an easy problem to solve, so I will post a solution.

My thoughts on how to approach this are :

  1. Sort the data by item_id
  2. For every item_id, scan through each word and check if it already exists for that item_id. If so then go to the next word, otherwise add the word to the unique list and increment the counter by 1
  3. When the last of the current item_id's is processed, output the unique word list and count

I've hopefully commented the code below sufficiently for you to follow what's going on, if not then look up the particular function or statement online.

/* create dummy dataset */
data have;
input item_id item_desc $30.;
datalines;
1   this is one
1   this is two
2   how many words are here
2   not many
3   random selection
;
run;

/* sort dataset if necessary */
proc sort data=have;
by item_id;
run;

/* extract unique words from description */
data want;
set have;
by item_id;
retain unique_words unique_count; /* retain value from previous row */
length unique_words $200; /* set length for unique word list */
if first.item_id then do; /* reset unique word list and count when item_id changes */
    call missing(unique_words);
    unique_count = 0;
    end;
do i = 1 by 1 while(scan(item_desc,i) ne ''); /* scan each word in description until the end */
    if indexw(unique_words,scan(item_desc,i),'|') > 0 then continue; /* check if word already exists in unique list, if so then go to next word */
    else do;
        call catx('|',unique_words,scan(item_desc,i)); /* add to list of unique words, separated by | */
        unique_count+1; /* count number of unique words */
        end;
end;
drop item_desc i; /* drop unwanted columns */
if last.item_id then output; /* output id, unique word list and count when last id */
run;
0
Haikuo Bian On

Following approach leverage Proc Freq to get 'keyword' distribution.

data have;
infile cards truncover;
input id var $ 100.;
cards;
1 This is test test
2 failed
1 be test
2 failed is
3 success
3 success ok
;

/*This is to break down the description into single word*/
data want;
set have;
do _n_=1 to countw(var);
new_var=scan(var,_n_);
output;
end;
run;

/*This is to give you words freq by id*/

ods output list=mylist (keep=id new_var frequency);
PROC FREQ DATA = want
    ORDER=FREQ
;
    TABLES id * new_var /
        NOCOL
        NOPERCENT
        NOCUM
        SCORES=TABLE
        LIST
        ALPHA=0.05;
RUN; QUIT;
ods _all_ close;
ods listing;