SAS how to get random selection by group randomly split into multiple groups

7.4k views Asked by At

I have a simple data set of customers (about 40,000k) It looks like:

customerid, group, other_variable
a,blue,y
b,blue,x
c,blue,z
d,green,y
e,green,d
f,green,r
g,green,e

I want to randomly select for each group, Y amounts of customers (along with their other variable(s). The catch is, i want to have two random selections of Y amounts for each group i.e.

4000 random green customers split into two sets of 2000 randomly
and 4000 random blue customers split into two sets of 2000 randomly

This is because I have different messages to give to the two different splits I'm not sampling with replacement. Needs to be unique customers

Would prefer a solution in PROC SQL but happy for alternative solution in sas if proc sql isn't idea

3

There are 3 answers

0
Longfish On BEST ANSWER

proc surveyselect is the general tool of choice for random sampling in SAS. The code is very simple, I would just sample 4000 of each group, then assign a new subgroup every 2000 rows, since the data is in a random order anyway (although sorted by group).

The default sampling method for proc surveyselect is srs, which is simple random sampling without replacement, exactly what is required here.

Here's some example code.

/* create dummy dataset */
data have;
do customerid = 1 to 10000;
length group other_variable $8;
if rand('uniform')<0.5 then group = 'blue'; /* assign blue or green with equal likelihood */
    else group = 'green';
other_variable = byte(97+(floor((1+122-97)*rand('uniform')))); /* random letter between a and z */
output;
end;
run;

/* dataset must be sorted by group variable */
proc sort data=have;
by group;
run;

/* extract random sample of 4000 from each group */
proc surveyselect data=have
                    out=want
                    n=4000
                    seed=12345; /* specify seed to enable results to be reproduced */
strata group; /* set grouping variable */
run;

/* assign a new subgroup for every 2000 rows */
data want;
set want;
sub=int((_n_-1)/2000)+1;
run;
1
yukclam9 On
data custgroup ;
do i=1 to nobs;
set sorted_data nobs=nobs ;
point = ranuni(0);
end;

proc sort data = custgroup out=sortedcust
by group point;
run;

data final;
set sortedcust;
by group point;
if first group then i=1;
i+1;
run;

Basically what I am doing is first assign a random number to all observations in the data set. Then perform sorting based on the variable group and point.

Now I achieved a random sequence of observation within group. i=1 and i+1 would be to identify the row of observation(s) within group. This means would avoid extracting duplicated observations . Use output statement as well to control where you want to store the observation based on i.

My approach may not be the most efficient one.

0
Oliver On

The code below should do it. First, you will need to generate a random number. As Joe said above, it is better to seed it with a specific number so that you can reproduce the sample if necessary. Then you can use Proc Sql with the outobs statement to generate a sample.

(BTW, it would be a good idea not to name a variable 'group'.)

data YourDataSet;
set YourDataSet;
myrandomnumber = ranuni(123);
run;

proc sql outobs=2000;
create table bluesample as
select *
from YourDataSet
where group eq 'blue'
order by myrandomnumber;
quit;

proc sql outobs=2000;
create table greensample as
select *
from YourDataSet
where group eq 'green'
order by myrandomnumber;
quit;