R sample from unbalanced panel data

Question

R sample from unbalanced panel data

1.3k views Asked by user1491868 At 11 December 2013 at 19:10

I am working with unbalanced panel data from which I would like to draw a random sample that is unbiased by the differing number of observations per unit. For example, in the code below, IBM is two times more likely to be selected than GOOG and five times more likely to be selected than MSFT. Is there any way to sample this data as if each company/year has an equal probability of being selected? Possibly by using the sampling package?

df <- data.frame(COMPANY=c(rep('IBM',50),rep('GOOG',25),rep('MSFT',10)), YEAR=c(1961:2010,1988:2012,1996:2005), PROFIT=rnorm(85))
df

df[sample(nrow(df), 20, replace=FALSE), ]

Original Q&A

There are 2 answers

overeducatedpoverty On 18 February 2016 at 14:54

i'm just a new R user, but here is my solution:

load example data (based on the PSID). data are unbalanced panel data: 98 individual observations, 15 groups, between 1977 and 1983 with gender identification (not used)

df <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 5L, 5L, 5L, 5L, 5L,5L, 5L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 7L, 7L, 7L, 7L, 7L, 7L, 7L,8L, 8L, 8L, 8L, 8L, 8L, 8L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 10L,10L, 10L, 10L, 10L, 10L, 10L, 11L, 11L, 11L, 11L, 11L, 11L, 11L,12L, 12L, 12L, 12L, 12L, 12L, 12L, 13L, 13L, 13L, 13L, 13L, 13L,13L, 14L, 14L, 14L, 14L, 14L, 14L, 14L, 15L, 15L, 15L, 15L, 15L,15L, 15L), year = c(1978L, 1979L, 1980L, 1981L, 1982L, 1983L,1977L, 1978L, 1979L, 1980L, 1981L, 1982L, 1983L, 1977L, 1978L,1979L, 1980L, 1981L, 1982L, 1983L, 1979L, 1977L, 1978L, 1979L,1980L, 1981L, 1982L, 1983L, 1977L, 1978L, 1979L, 1980L, 1981L,1982L, 1983L, 1977L, 1978L, 1979L, 1980L, 1981L, 1982L, 1983L,1977L, 1978L, 1979L, 1980L, 1981L, 1982L, 1983L, 1977L, 1978L,1979L, 1980L, 1981L, 1982L, 1983L, 1977L, 1978L, 1979L, 1980L,1981L, 1982L, 1983L, 1977L, 1978L, 1979L, 1980L, 1981L, 1982L,1983L, 1977L, 1978L, 1979L, 1980L, 1981L, 1982L, 1983L, 1977L,1978L, 1979L, 1980L, 1981L, 1982L, 1983L, 1977L, 1978L, 1979L,1980L, 1981L, 1982L, 1983L, 1977L, 1978L, 1979L, 1980L, 1981L,1982L, 1983L), gender = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L,1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L,2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L)), .Names = c("id", "year","gender"), row.names = c(NA, 98L), class = "data.frame")

create data frame with 1 observation per group id (in this example, there are 15 distinct groups)

sample <- select(df, id) %>% group_by(id) %>% sample_n(1)

create sample of 5 random observations out of 15

sample <- ungroup(sample) %>% sample_n(5) %>% mutate(id=row_number())

merge m:1 old data frame with sample data frame

df_new <- merge(x = df, y = sample, by = "id", all.y = TRUE)

**Julius Vainora** · Accepted Answer · 2013-12-11T19:35:54+00:00

Here is what you could do:

probs <- 1 / table(df$COMPANY)[df$COMPANY]
df[sample(nrow(df), 20, replace = FALSE, prob = probs), ]

Let us test it:

table(df[sample(nrow(df), 1e6, replace = TRUE, prob = probs), "COMPANY"])
#   GOOG    IBM   MSFT 
# 333499 333080 333421

Instead of having probabilities for every row equal to 1/(50+25+10) we normalised them so that every company would have equal probability to be chosen:

tapply(probs, df$COMPANY, sum)
# GOOG  IBM MSFT 
#   1    1    1

(probs sums to 3 instead of 1, but sample takes care of that). To make the math clearer let us take a simple example (which again does not sum to 1, but that is not a problem):

vec <- c(1, 1, 2)
as.vector(1 / table(vec)[vec])
# [1] 0.5 0.5 1.0

TechQA.

R sample from unbalanced panel data

There are 2 answers

Related Questions in R

Related Questions in RANDOM

Related Questions in PANEL-DATA

Popular Questions

Popular Tags

Trending Questions