Randomizing 1s and 0s by groups while specifiying proportion of 1 and 0 within groups

47 views Asked by At

First, I want to create a column that randomize 1s and 0s by group while maintaining the same proportion of 1s and 0s in another column.

Second, I want to repeat the above procedure many times (say 1000) and calculate the expected value.

Let me clarify with hypothetical data.

library(data.table) 

district <- c(1,1,1,1,2,2,2,2,2,3,3,3,3,3,3,3)                                       
village <- c(1,2,3,4,1,2,3,4,5,1,2,3,4,5,6,7)                              
status <- c(1,0,1,0, 1,1,1,0,0,1,1,1,1,0,0,0) 

datei <- data.table(district, village, status) 

What I want to do is I want to create a column that randomize 1s and 0s within a district while maintaining the same proportion of 1s and 0s in status; the proportions of 1:0 are 2:2, 3:2 and 4:3 in district 1, 2 and 3 respectively.

Second, I also want to repeat this randomization many times (say 1000 times) and calculate the expected value for each row.

I know how to randomize 1s and 0s based on district.

datei[, random_status := sample(c(1,0), .N, replace=TRUE), keyby = district]

However, I do not know how to have the same proportion of 1s and 0s as in status and how to repeat and calculate the expected values for each row.

Many thanks.

Edit: Let me add what I expect regarding calculating the expected values for each raw after, say, 1000 repetitions. Column exp_status is generated after randomizing many times while keeping the proportion of 1:0 within district is the same as in status.

district village status exp_status
1 1 1 0.9
1 2 0 0.7
1 3 1 0.8
1 4 0 0.1
2 1 1 0.2
2 2 1 0.3
2 3 1 0.2
2 4 0 0.9
2 5 0 0.8
3 1 1 0.4
3 2 1 0.5
3 3 1 0.9
3 4 1 0.8
3 5 0 0.9
3 6 0 0.8
3 7 0 0.7
2

There are 2 answers

3
jay.sf On BEST ANSWER

Use a table as prob=, which gives on large scale similar proportions.

set.seed(42)
datei[, random_status := sample(0:1, .N, replace=TRUE, prob=table(status)), keyby = district]

colMeans(datei[, 3:4])
      #  status random_status 
      # 0.56339       0.56277 

Data:

(slightly blown up, to 1e5 rows)

datei <- structure(list(district = c(1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 
3, 3, 3, 3, 3), village = c(1, 2, 3, 4, 1, 2, 3, 4, 5, 1, 2, 
3, 4, 5, 6, 7), status = c(1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 
1, 0, 0, 0)), row.names = c(NA, -16L), class = c("data.table", 
"data.frame"))

set.seed(42)
datei <- datei[sample.int(nrow(datei), 1e5, replace=TRUE), ]
1
Maël On

The default behavior of sample is exactly what you are looking for, i.e. reshuffling:

library(dplyr)
datei |> 
  mutate(random_status = sample(status), .by = district)

#or
library(data.table)
datei[, random_status := sample(status), district]

As for the second question, I join @Paul Stafford Allen's comment in that it will always be .5, as per the law of large numbers.