Creating Multiply Imputed Data Sets

Note: This section refers to Stata 11 or higher. Here, analysis of multiply imputed data is achieved by commands that start with mi. For data analysis, this command often is a composite prefix (mi ...:) which is followed by a standard Stata command. Before version 11, analysis of such data was possible with the help of ados; the basic commands started with mim. But I won't say anything about this here.

The following commands show a few basic steps in the preparation and creation of multiply imputed data sets. Stata also offers commands to deal with importing data sets that have been imputed outside Stata; to learn more, have a look at help mi import. The analysis of multiply imputed data sets will be dealt with, albeit briefly, in the next entry.

Preparatory steps

As usual, what follows assumes that you have already made up your mind what to do; in other words, you have decided to use a multiple imputation procedure and you also have a basic idea about your imputation model.

Setting your data

Multiply imputed data sets can be stored in different formats, or "styles" in Stata jargon. Not surprisingly to experienced Stata users, the basic difference is between wide and long formats (or styles). In wide format, the data are stacked side by side, or horizontal; in long format, the data are stacked vertically. Actually, three different long formats are offered:

flong	Each data set contains both the complete cases or rows (i.e. those in which no missing values were imputed) and the cases (or rows) which contain imputed values.
flongsep	As before, but each data set is stored in a different file.
mlong	Here, the data sets created in the imputation step will contain only the cases (rows) with imputed values; only the first (original) data set will also contain the complete cases.

With flongsep, special caution is necessary. In any case, the choice might be driven mainly by memory or disk space considerations. For "normal" users (moderate sized data sets, no special tricks) wide or flong might be the most natural formats. So,

mi set flong

will inform Stata that you wish to create multiply imputed data stored in the "flong" format.

With the help of the mi convert command, the style of the imputed data can be changed whenever you like. Thus, if your data are in wide style but for some reason you have to change them into flong style, just write

mi convert flong

Analysis of missing patterns

misstable summ var1 var2 var4

will provide information about the number and the type of missing values in the variables listed.

Note that Stata will impute only what are called "soft" missing values, i.e. those denote by a dot. So, if there "hard" missing values like ".a" etc. and you wish to have them imputed, you should now recode these values accordingly.

misstable pattern var1 var2 var4, bypattern

This will produce an overview of missibg value patterns, which is important to find out whether or not the pattern is monotonous. This will be decisive for the appropriate imputation strategy.

Two further commands, misstable nested and misstable tree may help you further in assessing the missing pattern, but IMHO misstable pattern will provide all the information that you will typically need.

Registering variables

Before proceeding, you have to give Stata some information about the variables involved. In particular, Stata has to know which variables are going to be imputed, as it will use this information for some later checks to prevent you from inadvertently messing up your data in later steps. For this, you have to use the following command:

mi register imputed var1 var2 var4

Another type of variables to be considered are "passive" variables – variables that are created from imputed variables. Even though registering passive variables is only important if you are using the "wide" format mentioned above, it will never hurt. The easiest way is to use the mi passive prefix, such as in

mi passive: gen agesq = age^2

where it is understood that variable age has been imputed (the process of which will be explained soon).

A third type of variables are "regular" variables, i.e., all other variables to be used in the later modelling steps. To register them is not necessary, but the Stata people nonetheless recommend it.

Finally,

mi describe

gives same basic information about the imputed variables, the passive variables, and some other things.

And now ... do it!

After all this preparation, you may proceed to the most important and probably most difficult step: Creating data sets with imputed data.

Actually, with the help of Stata the practical difficulties in most cases are minor. What is important is the choice of the proper imputation model, which involves a number of considerations that cannot be mapped out here. But it is safe to surmise that in most cases a chained equation imputation will be required. This is the method of choice if (a) there is more than one variable with missing values, (b) the missing pattern is not monotonous, and (c) not all variables with missing values are continuous. This situation is very common in the social sciences.

Here is an example for a somewhat complex chained equations, or ICE, imputation command:

mi impute chained (regress) income jobexper (ologit) satisfac (truncreg, ll(60) ul(220)) bpressure = i.sex i.ethnic age, add(20) dots

The parts of the command in parentheses refer to the statistical model that is assumed: regress, a (linear) regression model, is used for variables income and jobexper, ologit, i.e. an ordered logistic regression, is considered the method of choice for satisfac (say, a five or seven or 11 point scale to measure satisfaction with whatever), and finally, truncreg, a truncated regression model (with lower and upper limits for the dependent variable) is applied to variable bpressure. On the right hand of the equals sign, three variables can be found that have no missing values, two of them categorical and one of them continuous. Note that variable satisfac will automatically appear as a categorical variable on the right hand side of the imputation models for the other variables.

The options are important here: add(20) informs Stata that 20 imputed data sets are to be created; dots requests Stata to display a dot for each imputation performed (which allows to judge whether any progress is being made).

As usual, more can be found typing help mi impute.

What if something went wrong during the imputation stage?

Of course, I cannot give you a comprehensive answer; I just want to hint at a few Stata commands.

"Undoing" the imputations

If, for whatever reason, you want to discard the imputed data, the appropriate command is:

mi extract 0

This will leave you with the original data in "unset" form. Note that there actually is a command mi unset, but the Stata people advise normal users against using it.

Discarding single imputations

To drop (delete) a specified imputation, e.g., imputation no. 17, use

mi set m -=(17)

Instead of a single number, you may also use a number list.

Note that there is a version of this command with a capital M. This may be used to drop a given number of imputations, as in

[CAUTION] mi set M -= 20

This will discard the first 20 imputations.