Is it possible to estimate at survey data at cluster level?

219 views Asked by At

While estimating from the survey data involving clustering and using survey package of r, is it possible to estimate at the cluster level? For eg; for following survey design:

data(api)
dclus1 <- svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc)  

This is an example which is reproduced from the survey package. Here, dnum is district and fpc is number of school in the district. In this case, can we creat a subset at district level? For example, to estimate total enrollment in for district with code 637:

sub1=subset(dclus1, dnum==637)
svytotal(~enroll, sub1)

I got the following output:

        total     SE
enroll 205824 203774

I do not know whether it is correct method or not. Any help would be greatly appreciated.

2

There are 2 answers

0
Anthony Damico On

i think it depends - and you might find that survey statisticians will disagree about whether you can do this in specific cases, but most would probably admit that, at least, you need to consider what it means for the data that you have before you can conclude your analysis is defensible.

consider how the sample was drawn and how many observations there were within the cluster. most complex sample surveys are not simple random samples, so both the clusters and the strata are not necessarily representative as individual pieces -- the survey design was constructed in order to construct a representative sample in aggregate but not at the sampling cluster level.

as one example, the bureau of labor statistics does not consider analyses using the region variable to be acceptable (region is correlated with their sampling design) for the consumer expenditure survey

it's possible that a cluster could be only under-represented groups within some small village. an extreme example, but i'd recommend that you proceed with caution when subsetting your microdata using the design variables.

2
Jan van der Laan On

Yes, you can use subset. From the documentation (see `?subset.survey.design):

Restrict a survey design to a subpopulation, keeping the original design information about number of clusters, strata. If the design has no post-stratification or calibration data the subset will use proportionately less memory.

You can also use

svyby(~enroll, ~dnum, design = dclus1, svytotal)

to calculate your statistics for all clusters.