Stratified sampling with multiple variables and probability proportional to size

644 views Asked by At

I have a dateset that looks something like this

library(tidyverse)
library(TeachingSampling)
data(Lucy)
Lucy <- Lucy %>% select(ID,Level,Zone,Employees,SPAM)
head(Lucy)
     ID Level Zone Employees SPAM
1 AB001 Small    A        41   no
2 AB002 Small    A        19  yes
3 AB003 Small    A        68   no
4 AB004 Small    A        89   no
5 AB005 Small    A        91  yes

I would like to draw a random sample with probability proportional to the size where my size is Employees.

In doing so I would like to stratify by two variables: Leveland SPAM. In doing so I would like to do use proportional allocation so that the sample size of each stratum is proportional to the population size of the stratum. I know how to do this with one strata, but I do not know how to do it with two strata

N <- sum(Lucy$Employees) #pop size
Lucy %>% group_by(Level) %>% 
  summarise(insamp=round(100*sum(Employees)/N)) # calculate number of people sampled in each strata

summary(Level)
 Level  insamp
  <fct>   <dbl>
1 Big         8
2 Medium     39
3 Small      53

# Defines the sample size at each stratum
mh<-c(8,39,53)
# Draws a stratified sample
res<-S.STPPS(Level, Employees, mh)
# The selected sample
ind<-res[,1]
# The information about the units in the sample is stored in an object called data
samp <- Lucy[find,]
# check
samp %>% group_by(Level) %>% 
  summarise(n=n())
# A tibble: 3 × 2
  Level      n
  <fct>  <int>
1 Big        8
2 Medium    39
3 Small     53

Does anyone knows how I can do this, combining a series of strata? so that create a sample that not only is stratified by Levelbut also by SPAM?

In this case my sample should also have 39 observations sin SPAM and 39 not.

Lucy %>% group_by(SPAM) %>% 
  summarise(insamp=round(100*sum(Employees)/N))
#  SPAM  insamp
  <fct>  <dbl>
1 no        39
2 yes       61
0

There are 0 answers