Estimation: Basics

Estimation means drawing conclusions from samples about the underlying population(s). Indeed, under favourable circumstances (if the data constitute a simple random sample), the statistics that characterize samples (say, the mean of a variable, or the proportion of cases with a property of interest) are at the same time the best estimates for the parameter of the population from which the sample was drawn. Estimation is so strongly intertwined with statistics that when sample characteristics are not the best estimates of the population parameter (the variance is a case in point), statistical software automatically uses formulae that are used for population estimates. Stata is no exception. Thus, in many ways every statistical procedure may be considered to yield estimates.

However, there is more to estimation. First, so far we were talking about point estimation – the estimation of parameters. But as all estimation is uncertain, point estimation always should be accompanied by interval estimation. Second, not all samples are simple random samples, and sophisticated statistical software should be able to deal with complex sampling procedures that are typical for survey research. Also, there are several alternative approaches to estimation that were (and still are in the process of being) developed to deal with problems that often arise whenever "real world" data are involved. For instance, for some characteristics we do not know enough about their sampling distribution (the distribution we use to compute standard errors), and therefore different procedures for estimation are required, with the bootstrap being currently the most well-known example. Or, missing values may impede estimation, as they reduce sample sizes or distort distributions. Multiple imputation is one class of procedures that were developed to deal with this problem.

This entry will look specifically on a few procedures for the estimation of "basic" descriptive parameters, notably means, proportions, totals and ratios. However, much of what will outlined here (particularly about different methods of estimation and about survey design) can also be used with regression models.

Estimating descriptive parameters: Overview

Means

mean income

will, by default, yield the mean, the standard error and a 95 percent confidence interval for variable "income", based on standard assumptions about simple random sampling. To obtain estimates for different subgroups, use the "over" option, such as in

mean income, over(region)

Finally, you may wish to choose a different confidence level, such as in

mean income, over(region) l(99)

which will induce Stata to compute 99 percent confidence intervals.

Warning:The arithmetic mean is not the appropriate statistic in some specific cases, such as growth rates or velocities. Command ameans will compute, in addition to the arithmetic mean, the harmonic mean and the geometric mean.

ameans speed

will compute the arithmetic, geometric and harmonic mean of variable "speed" plus their confidence intervals. A variable list (instead of a single variable name) may be used with ameans; however, the command cannot be combined with over, so you'll have to use the by prefix or the if clause in case you want to compute one or several means conditional on the values of some other variable(s). Instead of ameans, the commands gmeans or hmeans may be used. Even though the user guide doesn't say so, the command means (not to be mixed up with mean) has the same effect.

Note, however, that these special procedures do not offer the possibilities for complex estimation outlined below.

Proportions

A frequency table or a crosstabulation may be used to estimate proportions for simple random samples, as in this case the sample proportions are the best estimates of the underlying population parameters. However, to obtain 95 percent confidence intervals for each proportion, you have to use the "proportion" command, as in

proportion emplstatus

or in

proportion emplstatus, over(gender)

Of course, the , l(#) option for a different confidence level may be used as well (just as in all other procedures that yield confidence intervals).

Note that the confidence intervals are different from what is presented in most elementary textbooks. Above all, they are not symmetric around the estimated proportion (except for a value of exactly .5).

Totals

Let's assume that a binary variables coded "0" and "1" denotes the absence (value 0) or presence (value 1) of some property (e.g., being a non-native, being affected by some illness, having graduated from college). For such variables, procedure total will yield an estimate plus the appropriate confidence interval for the total of cases in the population with this property. Typically, you will use this command only if your dataset also contains the appropriate weights for such an estimation procedure.

Thus,

total emplstatus [pw=weight]

will yield an estimate of the population total, whereas

total emplstatus [pw=weight], over(gender)

will yield separate estimates by gender.

Actually, what total estimates is the mean of the variable(s) under investigation, weighted by the number of (weighted) cases. Thus, any other values than "0" or "1" will most likely produce a meaningless result, at least in terms of proportions.

Ratios

Estimating population ratios is something for specialists; please see Stata's help at "help ratio".

Estimation beyond simple random sampling

In addition to estimation based on simple random sampling, various possibilities exist to deal with situations in which standard estimation procedures do not apply. Some of these are discussed in the following entries, most notably Stata's survey procedure to deal with complex samples and its features to deal with multiply imputed data. This section presents some further procedures that are available as options for many of Stata's commands (notably for regression models), including those presented above..

Clustered samples

Problems arise when cases were not sampled independently from each other (such as in the cluster sampling procedures that are so typical for much survey research, particularly when face-to-face interviews are applied). As cases from a cluster typically share some similarities (after all, they share a common "environment"), the standard estimation procedures do not apply, at least not in the strict sense. Perhaps you have come across the abbrevation "i.i.d." which is often used in statistical texts. It stands for "independent and identically distributed" to describe the standard assumption that all cases come from the same population (or at least populations with the same characteristics), but were collected independently from each other. The "i.i.d." assumption is violated in the case of clustered sampling.

Data stemming from cluster sampling procedures should contain a variable that denotes to which cluster each case belongs (often this cluster is called "primary sampling unit"). Now, you can use the "cluster" option for computation of standard errors as follows (assuming by way of example that the variable denoting clusters indeed is called "psu"):

proportion emplstatus, vce(cluster psu)

"vce" stands for "variance-covariance estimation", with variance referring to the square of the standard error of the estimates.

Bootstrap and jackknife estimation

Bootstrap and jackknife are estimation procedures that compute the standard errors immediately from the sample, without assumptions about the distribution of statistics drawn (repeatedly) from populations. As most statisticians think that bootstrap estimators have better properties, I will restrict this section to bootstrapping.

Bootstrapping means that a number of samples is drawn with replacement from the data; each sample has the same size as the data set, but due to drawing with replacement each (or almost each) sample will be different from all other samples. This way, we can arrive at an idea about how much samples drawn from the population may vary; actually, the variance obtained from the different samples is used to calculate the standard errors. As each sample is as large as the data at hand, the only "variable" in this procedure is the number of samples to be drawn. With modern computers, you need not be modest and may easily draw 100, 200 or even 500 samples, at least if your data set is not very, very huge.

An example command (with 500 samples, i.e. repetitions of the drawing) might look like this:

proportion emplstatus, vce(bootstrap, rep(500))

Robust standard errors

Robust standard errors were developed to deal with cases where the available data do not meet the strict requirements that exist for some statistical procedures. For instance, linear regression assumes that the variance of the residuals is the same over the entire range of the dependent variable (the technical term for this is homoscedasticity). As long as the cases in your sample were collected independently from each other, you may be on the safe side when you use robust standard errors.

Acordingly, most of Stata's regression procedures (not, however, the estimation procedures for simple descriptive statistics presented above) permit using the , vce(robust) option; in most cases, you may simply write , robust.

Testing

The estimates from most of Stata's procedures may also be used to test assumptions about parameters. For instance, after having estimated two proportions you may wish to test whether they are equal to each other (remember that statistical testing is about drawing inferences from samples to populations).

There is, of course, quite a lot that may be said about testing; many procedures, such as the t-test, analysis of variance, or Pearson's chi-square test for crosstabulations, are part of even the most basic statistical training. Many elementary test procedures are described elsewhere in this guide. Here, I will restrict myself to testing assumptions about the estimates for proportions.

Testing proportions

To understand the command for testing (differences between) proportions, you first have to consider what a table with estimated proportions looks like. Regrettably, this depends to some extent on your data and your value labels. Just look at how the different groups are denoted in the table produced by procedure proportion!

In a crosstabulation, which is the outcome, for instance, of proportion emplstatus, over(gender), the different values of the first variable often are just termed "_prop_1", "_prop_2" and so on. Likewise, the groups formed by the second variable may be denoted by "_subpop_1", "_subpop_2" etc. So, if you want to test whether the proportion in the fifth category of "emplstatus" is the same for both genders, use the following command:

test [_prop_5]_subpop_1 = [_prop_5]_subpop_2