I have very little programming experience, but I'm working on a statistics project and would like to generate an unequal probability sample where the inclusion probability of a unit is based on its size (PPS).
Basically, I have two datasets:
ds1
lists US states and the parameter I'm trying to estimateds2
has the population size of each state.
My questions:
I want to use R to select a random sample from the first dataset using inclusion probabilities based on the population of each state (second dataset).
Also is there any way to use R to calculate these Generalized Unequal Probability Estimator formulas?
Also just a note on the formulas: pi_i is inclusion probability and pi_ij is joint inclusion probability.
Yes, that's called weighted sampling. Simply set the weight to the size of the state, strictly you don't even need to normalize them by
1/sum(sizes)
although it's always good practice to. There are tons of duplicate posts on SO showing how to do weighted sampling.The only tiny complication is that you need to do a
join()
of the datasetsds1, ds2
. Show us what code you've tried if it's causing problems. Recommend you use eitherdplyr
ordata.table
.Your second question should be asked as a separate question, and is offtopic on SO, or at least won't get a great response - best to ask statistical questions at sister site CrossValidated