Recode Variables: Command recode

If you wish to change the categories of a variable, you may employ the command recode. Normally, the recoded variable is not supposed to replace the original variable; rather, you will add the variable with the recoded vlues to the data set under a different name.

Basics

Here's a simple example. Let's assume that a a variable with eight categories is to be simplified by merging some of the categories:

recode industry (1 2 = 1) (3 4 5 = 2) (6/8 = 3), gen(industry_3)

Here, 6/8 means "6 through 8"; the boundaries (i.e. 6 and 8) are included. What follows after the comma causes Stata store the result in variable "industry_3". If you are sure that you want to retain the original name of the variable with the recoded values, you may omit the gen option; but the original values of the variable will be lost in this case. (Of course, you will have stored a copy of the dataset elsewhere, so this is not a serious problem.)

In the simple example presented here, the parentheses around the clauses (e.g., 1 2 = 1) are optional. They are required, however, when several variables are recoded with a single command or if the recoded values are labeled (see below).

If several variables are to be recoded in the same way, you don't have to write several recode commands; just list all of these variables after recode. If you want to store the recoded variables as new variables, you may add the appropriate number of (new) variable names within the parentheses following gen.

Two further things to be noted:

1. Recoding is often useful to group (or classify) the values of a variable with many values, such as age or income. It may seem tedious to define the boundaries of the classes in order for them not to overlap, but actually this is not the case. You may write, for instance:

recode income (0/1000 = 1) (1000/2000 = 2) (2000/max = 3), gen(inc_gr)

Stata will place all respondents with an income of 1000 in the first category, and all those whose income exceeds 1000 (whether it's 1000.1 or 1001 or 1500) and is less or equal 2000 in the second category, and so on.

2. Some keywords may appear on the left of the equals sign to make things easier:

min   refers to the minimum value, whatever it is
max   the maximum value (missing values are not considered here!)
else or *   all other values
miss   all missing values not changed by another rule
nonmiss   all non-missing values not changed by another rule

Labels

When generating a new variable, the new values can be em>labelled in the process of recoding. It goes like this:

recode industry (1 2 = 1 "Primary sector") (3 4 5 = 2 "Secondary sector") ///
   (6/8 = 3 "Tertiary sector"), gen(industry_3)

The double inverted commas around the labels are necessary only if a label contains blanks; but I can't see no reason why you should not always use them.

Recode with "if" or "in"

Recode may be used together with an if and/or an in clause. In this case, all values that do not meet the condition(s) specified will be recoded to missing values if the gen option is used. However, these values will be left unchanged if the option copyrest is added.

© W. Ludwig-Mayerhofer, Stata Guide | Last update: 31 May 2018