Stata Simple Randomization in One Line WITHOUT USING SPLITSAMPLE

77 views Asked by At

I'd Like to create a simple randomization of a dataset. The goal is to have 500 in treatment and 500 in control. This question is about Stata efficiency: I want to do it in one line.

I can do it in one line with imbalanced groups or three lines with perfect balance.

One line:

clear all
set obs 1000

//one line
gen treatment = mod(floor(runiform() * 1000),2)

This is most likely imbalanced.

Three lines:

gen rand_n =  runiform()
sum (rand_n),d
gen treatment_again =  rand_n <= r(p50)

clunky, terrible, you can't even bysort in a single line like this!

I want to do this in one line, maybe two.

Why? Because Stata.

2

There are 2 answers

2
dimitriy On BEST ANSWER

Since splitsample is precluded (it is slow), there are two options.

First, you can repackage your clunky code into a program on the fly. I am not sure if that counts as a solution in your mind, but is a good strategy if you have to sample multiple times.

Second, use egenmore (short for extended generate). egenmore is usually where one-line solutions to such problems can be found. You will need to install it with ssc install egenmore as it is a community-contributed command.

Here's an example of all three producing balanced groups of 500:

. clear all

. timer clear

. set obs 1000
Number of observations (_N) was 0, now 1,000.

. 
. 
. timer on 1

. splitsample, nsplit(2) gen(treatment)

. timer off 1

. 
. 
. timer on 2

. capture program drop my_ss 

. program define my_ss
  1.         capture drop treatment_again
  2.         tempvar r
  3.         gen `r' =  runiform()
  4.         _pctile `r', percentile(50)
  5.         di r(r50)
  6.         gen treatment_again = `r' <= r(r1)
  7. end

. 
. my_ss
.

. timer off 2

. 
. timer on 3

. egen treatment_again2 = rndsub()

. timer off 3

. 
. timer list
   1:      0.17 /        1 =       0.1700
   2:      0.00 /        1 =       0.0010
   3:      0.00 /        1 =       0.0020

. 
. tab1 treatment*

-> tabulation of treatment  

  treatment |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |        500       50.00       50.00
          2 |        500       50.00      100.00
------------+-----------------------------------
      Total |      1,000      100.00

-> tabulation of treatment_again  

treatment_a |
       gain |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |        500       50.00       50.00
          1 |        500       50.00      100.00
------------+-----------------------------------
      Total |      1,000      100.00

-> tabulation of treatment_again2  

treatment_a |
      gain2 |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |        500       50.00       50.00
          2 |        500       50.00      100.00
------------+-----------------------------------
      Total |      1,000      100.00
0
Nick Cox On

The efficiency of doing a calculation in one line has to be compared with the inefficiency when that one-line command is in practice a call to several other commands (and functions). I often agree that egen is convenient and often use it myself. Indeed I have written several egen functions: some now folded into official Stata, some included in egenmore, and yet others.

That said, something like this

clear 
set obs 1000 
set seed 314159 
gen double rnd = runiform()
sort rnd 
gen wanted = 1 + (_n > 500) 

may look long-winded, but it won't be terribly slow. It also guarantees groups of equal size.