Missing Values

Stata uses certain 'values' of variables as indicators of missing values. If I am not mistaken, until version 8 there was only one missing value, the dot. As of version 9, letters .a to .z (preceded by a dot) also are interpreted as missing values (these are called extended missing values). Missing values are excluded from all statistical analyses by default; some procedures (like frequency tables or crosstabulations) permit inclusion of missing values via options.

If a variable contains numeric values that represent missing values, these have to be changed into one of the codes for missing values (unless they are supposed to be included in your analyses). This can be achieved with the help of Stata commands for data transformations, i.e. generate/replace and recode. However, there is a special procedure in Stata that makes dealing with missing values safer.

Using mvdecode and mvencode for treatment of missing values


mvdecode is used to transform numerical values into missing values. If there is one code for missing values, you may simply write

mvdecode income, mv(999999)

Note that this will not just inform stata that 999999 it to be treated as a missing value; rather it will actually change the value of 999999 into a missing value, normally the dot. This will also occur if you set several numerical values to missing, as in

mvdecode income, mv(999997 999998 999999)

Yet, if there are several missing values you typically will want to be able to retain the difference, i.e. to transform each numerical value to a different missing value. This can be done as follows:

mvdecode income, mv(999997 = .a \ 999998 = .b \ 999999 = .c)

This way, missing values can be changed back to the original (or any other) values if desired by the mvencode command:

mvencode income, mv(.a = 999997 \ .b = 999998 \ .c = 999999)

The basic advantage of using mvencode over recode is that mvencode will return an error code (and will not be executed) if you want to transform a missing value to numerical value that already is present in your data (and hence most likely is a valid value). (Thanks to Frauke Kreuter for pointing this out to me.)


Note that the missing values are coded internally as values that are higher than any numeric value.The "lowest" missing value is the dot, i.e. ".". This means that if you want to do something if a case does not have a missing value in a given variable (say, income), you may use the expression if income < . to achieve this effect.

© W. Ludwig-Mayerhofer, Stata Guide | Last update: 30 Mar 2012