Internet Guide to Stata
Print article

Complex samples

Data in the social sciences often do not constitute a simple random sample. Rather, complex sampling frames may be involved. The svy option was developed to deal with such data.

Again, I start by way of an example. A typical sample in the social sciences may involve:

Let's assume that the PSU variable is called "region" and the weights (=inverse probability of being selected) are in variable "weight".

svyset region [pweight=weight]

will inform stata about the primary sampling units and the weights. Some stata procedures now can be run with the svy: prefix, such as in:

svy: regression income educ jobexper firmsize

Note 1: Weights can be used in all (or most) statistical procedures simply by adding the [pw=....] subcommand. The most important thing about svy is the definition of the primary sampling units.

Note 2: You can find out which stata procedures may be used with the svy prefix via help svy estimation

Note 3: There is a lot more about svy. Two very simple issues: If you have used stratified sampling, strata can be defined; let's take the example used above further and add strata that are defined in variable "stratif":

svyset region [pweight=weight], strata(stratif)

Also, you may apply a correction for finite populations, which can be important if the population from which the sample comes is small. The reason is that statistical estimation in most cases assumes that the sample was drawn with replacement. In fact, most likely all samples in the social sciences are without replacement, but this causes no problem as long as the sample size is small in relation to the population from which it was drawn. If you have small populations (say, firms with a few hundred or thousand employees), a finite population correction is the thing to do.

Note 5: After having estimated one or several parameters, you can obtain information about design and misspecification effects:

estat effects

yields DEFF and DEFT (that is, the design effect and its square root). To obtain the misspecification effects MEFF und MEFT, use the options , meff meft. This, however, will suppress calculation of DEFF and DEFT, so you have to add the options , deff deft as well to obtain all four values

Note 5: One could (and probably should) say more about svy concerning more complicated issues. But the preceding commands will help in most situations in which the usual social survey designs have been used.

© W. Ludwig-Mayerhofer, Stata Guide | Last update: 06 Jun 2015