Complex Samples

Data in the social sciences often do not constitute a simple random sample. Rather, complex sampling frames may be involved. The svy option was developed to deal with such data.

Again, I start by way of an example. A typical sample in the social sciences may involve:

  • Cases that were not simply drawn by a random procedure from the entire population; rather, first a number of higher level units (regions, schools, firms) was selected randomly and individuals were sampled within these units in a next step. The higher level units are called primary sampling units (PSUs).
  • Cases with different probability of being selected. For instance, oversampling is often used to get sufficient numbers of certain cases. Or something is known about the probability of being selected (this probability not being the same for all cases).

Let's assume that the PSU variable is called "region" and the weights (=inverse probability of being selected) are in variable "weight".

svyset region [pweight=weight]

will inform stata about the primary sampling units and the weights. Some stata procedures now can be run with the svy: prefix, such as in:

svy: regression income educ jobexper firmsize

Note 1: Weights can be used in all (or most) statistical procedures simply by adding the [pw=....] subcommand. However, some procedures, such as tabulate, admit frequency weights only (plus the less known, and less useful, analytical weights). Here, using svy can be helpful (as is indeed the case with tabulate).

Note 2: You can find out which stata procedures may be used with the svy prefix via help svy estimation

Note 3: There is a lot more about svy. Two very simple issues: If you have used stratified sampling, strata can be defined; let's take the example used above further and add strata that are defined in variable "stratif":

svyset region [pweight=weight], strata(stratif)

Also, you may apply a correction for finite populations, which can be important if the population from which the sample comes is small. The reason is that statistical estimation in most cases assumes that the sample was drawn with replacement. In fact, most likely all samples in the social sciences are without replacement, but this causes no problem as long as the sample size is small in relation to the population from which it was drawn. If you have small populations (say, firms with a few hundred or thousand employees), a finite population correction is the thing to do.

Note 4: After having estimated one or several parameters, you can obtain information about design and misspecification effects:

estat effects

yields DEFF and DEFT (that is, the design effect and its square root). To obtain the misspecification effects MEFF und MEFT, use the options , meff meft. This, however, will suppress calculation of DEFF and DEFT, so you have to add the options , deff deft as well to obtain all four values

Note 5: One could (and probably should) say more about svy concerning more complicated issues. But the preceding commands will help in most situations in which the usual social survey designs have been used.

© W. Ludwig-Mayerhofer, Stata Guide | Last update: 23 Apr 2017