I am working with unbalanced panel data from which I would like to draw a random sample that is unbiased by the differing number of observations per unit. For example, in the code below, IBM is two times more likely to be selected than GOOG and five times more likely to be selected than MSFT. Is there any way to sample this data as if each company/year has an equal probability of being selected? Possibly by using the sampling package?
df <- data.frame(COMPANY=c(rep('IBM',50),rep('GOOG',25),rep('MSFT',10)), YEAR=c(1961:2010,1988:2012,1996:2005), PROFIT=rnorm(85))
df
df[sample(nrow(df), 20, replace=FALSE), ]
Here is what you could do:
Let us test it:
Instead of having probabilities for every row equal to 1/(50+25+10) we normalised them so that every company would have equal probability to be chosen:
(
probs
sums to 3 instead of 1, butsample
takes care of that). To make the math clearer let us take a simple example (which again does not sum to 1, but that is not a problem):