Internet Guide to Stata |
Print article |
Note: This section refers to Stata 11 or higher. Here, analysis of multiply imputed data is achieved by commands that start with mi. For data analysis, this command often is a composite prefix (mi ...:) which is followed by a standard Stata command. Before version 11, analysis of such data was possible with the help of ados; the basic commands started with mim. But I won't say anything about this here.
The following commands show a few basic steps in the preparation and creation of multiply imputed data sets. Stata also offers commands to deal with importing data sets that have been imputed outside Stata; to learn more, have a look at help mi import. The analysis of multiply imputed data sets will be dealt with, albeit briefly, in the next entry.
As usual, what follows assumes that you have already made up your mind what to do; in other words, you have decided to use a multiple imputation procedure and you also have a basic idea about your imputation model.
Multiply imputed data sets can be stored in different formats, or "styles" in Stata jargon. Not surprisingly to experienced Stata users, the basic difference is between wide and long formats (or styles). In wide format, the data are stacked side by side, or horizontal; in long format, the data are stacked vertically. Actually, three different long formats are offered:
flong | Each data set contains both the complete cases or rows (i.e. those in which no missing values were imputed) and the cases (or rows) which contain imputed values. |
flongsep | As before, but each data set is stored in a different file. |
mlong | Here, the data sets created in the imputation step will contain only the cases (rows) with imputed values; only the first (original) data set will also contain the complete cases. |
With flongsep, special caution is necessary. In any case, the choice might be driven mainly by memory or disk space considerations. For "normal" users (moderate sized data sets, no special tricks) wide or flong might be the most natural formats. So,
mi set flong
will inform Stata that you wish to create multiply imputed data stored in the "flong" format.
With the help of the mi convert command, the style of the imputed data can be changed whenever you like. Thus, if your data are in wide style but for some reason you have to change them into flong style, just write
mi convert flong
misstable summ var1 var2 var4
will provide information about the number and the type of missing values in the variables listed.
Note that Stata will impute only what are called "soft" missing values, i.e. those denote by a dot. So, if there "hard" missing values like ".a" etc. and you wish to have them imputed, you should now recode these values accordingly.
misstable pattern var1 var2 var4, bypattern
This will produce an overview of missibg value patterns, which is important to find out whether or not the pattern is monotonous. This will be decisive for the appropriate imputation strategy.
Two further commands, misstable nested and misstable tree may help you further in assessing the missing pattern, but IMHO misstable pattern will provide all the information that you will typically need.
Before proceeding, you have to give Stata some information about the variables involved. In particular, Stata has to know which variables are going to be imputed, as it will use this information for some later checks to prevent you from inadvertently messing up your data in later steps. For this, you have to use the following command:
mi register imputed var1 var2 var4
Another type of variables to be considered are "passive" variables – variables that are created from imputed variables. Even though registering passive variables is only important if you are using the "wide" format mentioned above, it will never hurt. The easiest way is to use the mi passive prefix, such as in
mi passive: gen agesq = age^2
where it is understood that variable age has been imputed (the process of which will be explained soon).
A third type of variables are "regular" variables, i.e., all other variables to be used in the later modelling steps. To register them is not necessary, but the Stata people nonetheless recommend it.
Finally,
mi describe
gives same basic information about the imputed variables, the passive variables, and some other things.
After all this preparation, you may proceed to the most important and probably most difficult step: Creating data sets with imputed data.
Actually, with the help of Stata the practical difficulties in most cases are minor. What is important is the choice of the proper imputation model, which involves a number of considerations that cannot be mapped out here. But it is safe to surmise that in most cases a chained equation imputation will be required. This is the method of choice if (a) there is more than one variable with missing values, (b) the missing pattern is not monotonous, and (c) not all variables with missing values are continuous. This situation is very common in the social sciences.
Here is an example for a somewhat complex chained equations, or ICE, imputation command:
mi impute chained (regress) income jobexper (ologit) satisfac (truncreg, ll(60) ul(220)) bpressure = i.sex i.ethnic age, add(20) dots
The parts of the command in parentheses refer to the statistical model that is assumed: regress, a (linear) regression model, is used for variables income and jobexper, ologit, i.e. an ordered logistic regression, is considered the method of choice for satisfac (say, a five or seven or 11 point scale to measure satisfaction with whatever), and finally, truncreg, a truncated regression model (with lower and upper limits for the dependent variable) is applied to variable bpressure. On the right hand of the equals sign, three variables can be found that have no missing values, two of them categorical and one of them continuous. Note that variable satisfac will automatically appear as a categorical variable on the right hand side of the imputation models for the other variables.
The options are important here: add(20) informs Stata that 20 imputed data sets are to be created; dots requests Stata to display a dot for each imputation performed (which allows to judge whether any progress is being made).
As usual, more can be found typing help mi impute.
© W. Ludwig-Mayerhofer, Stata Guide | Last update: 16 Feb 2016