Create New, or Modify Existing, Variables: Commands generate/replace and egen

To create new variables (typically from other variables in your data set, plus some arithmetic or logical expressions), or to modify variables that already exist in your data set, Stata provides two versions of basically the same procedures: Command generate is used if a new variable is to be added to the data set, whereas replace, obviously enough, is used to replace an old (= already existing) variable. For beginners, it may be important to bear in mind that the moment a variable is generated, it exists; therefore, even a variable that has just be created via generate, in the next step has to be addressed by replace.

Note that modern statistical software offers a tremendous range of possibilites, and a brief guide like mine cannot do justice to all of them. Still, this entry is rather lengthy.

Introduction and a few examples

For once, let me start with a general formulation of the syntax:

generate newvar = expression

"Expression" can be a mathematical argument. For instance,

gen dist_abs = abs(distance)

will return the absolute value of variable distance, i.e. negative values will be turned into positive ones. But most of the time "expression" will contain mathematical operators, such as in the following example:

gen pcincome = income / nhhmembers

That is, a variable "per capita income" is created by dividing the total income by the number of household members. Of course, operators and other arguments may be combined, and parentheses can be used to clarify the order in which operators are supposed to work.

Command replace works the same way, except that on the left side of the "equals" sign a variable is named that is already present in your data set. Even though this is not typical, this variable may also appear on the right hand side of the "equals" sign, such as in:

replace distance = log(distance)

Whether or not it makes sense to do something like this is up to you. But replace is needed not infrequently if you cannot or do not want to create a new variable in a single rush. For instance, you may first define a new variable with generate and then modify it for a subset of cases (i.e. if certain conditions are fulfilled). In this case, you will have to refer to the variable by replace in all subsequent steps.

The previous sentence shows, even though implicitly, that generate and replace can be combined with an if clause (just as almost any other Stata command can). Consider the following real-life example: In surveys, respondents are often asked about their income. If they do not want to give an exact figure, they are asked whether their income is within a certain income bracket; this often triggers a considerable number of additional, albeit somewhat less exact, answers. Let's assume that the exact answers are stored in variable income_1 and the categorized answers in income_2. Those respondents who have given an exact answer have value 98 in income_2, whereas all who have given a categorized answer have a value less than that (say, from "1" for "0 to 300 Euros" to 15 for "more than 10,000 Euros"). Now, you may wish to combine both types of answer into a single variable:

generate income_combined = income_1
replace income_combined = income_2 if income_2 < 98

By the way: generate may be abbreviated by gen or even g, whereas there is no abbrevation for replace.

More detail about the operators used (i.e., expressions like "/" oder "<") will follow in the next sections; but first a few more general remarks.

Value labels

You may connect a newly created variable to the name of a value label, such as in:

gen leftright:lflabel = 1 if var13 < 3

The value label "lflabel" may or may not already have been defined.

Storage type

It may be important to create a variable of a certain storage type. For instance, if a variable you want to create is to be stored in the "long" format (apt for integers up to 8 digits), this may be specified as follows:

g long income_combined = income_1 + income_2 + income_3

You should use one of the storage types for integer values (byte, integer or long) whenever appropriate, as data types for variables with decimal values (float or double) may cause messy (though tractable) problems.

String variables can have a length of 1 to 244 characters; the format is "stringX" with X being replaced by the number of characters you wish to allocate to the variable.

Mathematical operators and functions

As outlined above, you will often want to use some maths when creating or changing variables, like adding, multiplying and so on. The basic mathematical operators available are as follows:

+		addition
-		subtraction
*		multiplication
/		division
^		power

A large number of functions is available (see help functions for more). Here are some examples:

abs(x)		absolute value of x
exp(x)		antilog of x
int(x) or trunc(x)		truncation to integer value
ln(x), log(x)		natural logarithm of x
round(x)		rounds to the nearest integer of x
round(x,y)		x rounded in units of y (i.e., round(x,.1) rounds to one decimal place)
sqrt(x)		square root of x

runiform()		returns uniformly distributed numbers between 0 and nearly 1
rnormal()		returns numbers that follow a standard normal distribution
rnormal(x,y)		returns numbers that follow a normal distribution with a mean of x and a s.d. of y

Note that functions that refer to several variables (such as the mean, total, or statistical parameters, over a number of variables) are available with the egen command (see below ).

Logical and relational operators

As outlined in the introductory section, often something is to be done "if" something is the case; e.g., if a variable has, or has not, or is larger than, a certain value. Here we are talking about relational operators (is something greater or smaller than, or [not] equal to something else?). Often, such relations may be combined; we wish to do something "if this" is the case "and if" something else is also the case. "And" and "or" are logical operators. A simple example might be

gen lonemother = 1 if gender == 1 & partner == 0 & nchild > 0

(Note that this example may be incomplete; it's just a very brief demonstration). More about "conditional" transformations will follow in the next section but one. But first, let's give an overwiew of the symbols used for comparing variables or numbers (relational operators) and for writing "and" and "or".

==		equal
!=		not equal
>, >=		greater than, greater than or equal
<, <=		less than, less than or equal
!		not
~		same as `!`
&		and
\|		or

Non-mathematical functions

Other functions may be helpful occasionally. Some of these are the following (use "help function" in Stata to learn more):

generate profession = inlist(isco88, 101, 125, 170, 171, 172, 202, 310)

The new variable "profession" will have a value of 1 if (extant) variable "isco88" has any of the values that now follow; otherwise, the value will be zero. Clearly, this is a simple way to avoid a long list of statements of the "if ... or ... or ... or ..." type. Note that you cannot use the expression "170/172" (meaning 170 thru 172) here. (Also note that the values I have used in this example are entirely fictitious; "isco88" is actually a categorization of occupations, but I did not look it up to find out which among these actually may be termed professions.)

generate clerical = inrange(isco88, 400, 490)

The new variable "clerical" will have a value of 1 if (extant) variable "isco88" lies in the rage of 400 to 490; otherwise, the value will be 0.

generate agroup = autocode(age, 4, 25, 70)

This creates variable "agroup" from variable age, grouping age into 4 categories as follows:
The range from 25 to 70 is divided into four equally spaced intervals (25 to 36.25, more than 36.25 to 47.5 etc.)
Each value from variable age is coded with the upper limit of the interval into which it falls in variable "agroup".

Conditional transformations

Introduction

New variables often use the information gathered in other variables by denoting certain combinations of values. That is, the new variable gets "this" value if something is the case, or "that" value if something else is the case, or finally "yet another" value if again something else is the case. I call such transformations "conditional" -- something is supposed to happen on the condition that something else is the case.

A simple case might concern the question whether a person is married (and living with her or his partner), cohabiting (living with a partner without being married), belonging to a "LAT" (living apart together) couple, or single. The pertinent information may be scattered over several variables, e.g., "marital status", named marstat (with categories married/living together = 1, married/separated = 2, divorced = 3, widow/er = 4, single = 5), "having a partner", partner (yes/no), and "living together with partner", livpart (yes/no). So you might create your new variable as follows:

gen hhtype = 1 if marstat == 1
replace hhtype = 2 if marstat > 1 & partner == 1 & livpart == 1
replace hhtype = 3 if marstat > 1 & partner == 1 & livpart == 0

Indicator variables

Often, conditional transformations are used to create indicator variables, that is variables coded "1" to denote the presence of a property and "0" to denote its absence (sometimes, particularly in the context of regression models, such variables are termed "dummy variables"). Continuing the example of marital status and family (or household) type, we might for some reason have to create a variable that just indicates whether or not a person is married. This might be done in a number of ways, but it would be natural to use a conditional transformation here, for instance:

gen married = 1 if marstat == 1 | marstat == 2
replace married = 0 if marstat > 2 & marstat < .

The clause & marstat < . prevents cases with missing values from being treated as valid cases. Note that the first line might also go like this:

gen married = 1 if inlist(marstat, 1, 2)

(see section about non-mathematical functions above), which is helpful especially if you wish to adress more than two values.

There is an even shorter way to create a 0/1 coded variable; however, you should use this only in the absence of missing values (or if you take special precautions). Look at the following example:

gen married = marstat == 1 | marstat == 2

Note that on the right hand of the equals sign you do not indicate a numeric value to be assigned; all that is mentioned is a condition, namely, that a case should have values 1 or 2 in variable marstat. Stata automatically assigns the value "1" if this condition is "true" and the value "0" if it is not. Note, however, that cases with missing values belong to the latter category (they don't have a value of 1 or 2), so they will be lumped together with persons that are not married. Of course, in this simple example this might be easily repaired by adding a line such as, for instance:

replace married = . if marstat >= .

Note that there is a simple way of creating dummy variables related to the tab command. This method is very simple inasmuch as it creates an indicator variable for each and every different value of the respective variable. To further continue our example,

tab marstat, gen(marst)

will create five dummy variables, marst1 to marst5, each representing one of the categories of variable marstat. Note that cases with missing values in the original variable will be assigned missing values for all of the newly created dummy variables.

If "X", do "this", if "Y", do "that"

The thrust of what follows here will perhaps not be clear to beginners, and it may never become important for many people. But for some transformation tasks what follows can be very useful. The example I use requires some knowledge of lags (treated further below), as this is a context in which the function described here may be particularly important.

It is one of Stata's programming functions, which may however invoked in the context of everybody's work of writing do-files. The function is named cond, which obviously has something to do with "condition". It works like this:

gen highinc = cond(income > 5000, 1, 0)

and does the following: If income is greater than 5000, highinc will have a value of 1, otherwise the value will be 0.

Thus far, you may say: So what? The same result can be achieved with a simple if command. It may also be achieved with the "short" version of generating dummy variables outlined in the previous section, the only difference being that you may use cond with other (resulting) values than 1 and 0 (e.g., you might have written gen highinc = cond(income > 5000, -17, 3000) however nonsensical this may seem.)

But the great thing about cond is that it may be used recursively. For instance, you might write:

gen highinc = cond(income > 5000, 2, cond(income > 3000, 1, 0))

which will result in highinc having a value of 2 if income > 5000, a value of 1 if income <= 5000 and > 3000, and a value of 0 otherwise (by the way: if there are missing values in variable income, you should start the sequence of conditions with cond(missing(income), ). You will have noticed that cond(income > 3000) in this example actually means > 3000 and <= 5000, as cases where income is larger than 5000 will never reach the third argument of the main clause. But still, elegant as it may look, all of this may likewise be achieved with a series of ifs.

But now look at the following data. These data describe a series of episodes in the life of a single person, with "begin" as the month in which the episode started and "end" as the month it terminated. Think, for instance, about a person who was asked to describe what happened once she had left school. This person may have held a job in the first month (denoted by "j" in the table), and in the second month she entered college ("c") where she stayed until month 24. But in addition, she has held another job in months 10 to 12 and a further job in months 13 to 16.

ID begin end type
1   1     1   j
1   2    24   c
1  10    12   j
1  13    16   j

Such data occur frequently in life histories, and actually this example is modelled after Germany's Socio-Economic Panel (where 1 actually refers to January 1983). Now, when analyzing these data, I may wish to know whether or not a person holding a job is at the same time in another "state" (e.g., going to college). The first step would be to find out, for each episode in the data, whether there is (at least) one other episode with which it overlaps. So, in case of the 3rd and 4th line, I would like to know that these episodes overlap with the second episode, or, generally speaking: It would like to know whether there is an episode that starts earlier (this is ascertained by the sorting of the data according to "begin") but ends after the beginning, or even the end, of the current episode. For this reason, I want to know for each episode the highest value of "end" that has occurred thus far, i.e., prior to that episode.

The command I use works like this (note that the suffix [_n-1] refers to the previous row in the data set; thus ID[_n-1] is the ID in the row immediately above the row to which a command is applied):

generate maxend = end if ID != ID[_n-1]
replace maxend = cond(end >= maxend[_n-1] & ID == ID[_n-1], end, ///
cond(end < maxend[_n-1] & ID == ID[_n-1], maxend[_n-1], maxend))

and the result is:

ID begin end type maxend
1   1     1   j     1
1   2    24   c    24
1  10    12   j    24
1  13    16   j    24

Finally, note that there is a special version of this command: Using, for instance, cond(var1,3,1) means that the new variable will have a value of 3 if var1 has any other value than zero, and a value of 1 if var1 is exactly zero. So what about missing values? Indeed, in this special case you should use, e.g. cond(var1,3,1,-1) which means that whenever var1 is missing, the new variable will have a value of -1. If you omit this fourth argument, missing values will be coded as 3.

Lags and leads

This heading refers to variables that are created or modified via reference to another row in the data set. Consider the following example: Assume that you have data about couples; each person is represented by a row in the data set and now you may wish to define variables for each person that refer to that person's partner.

Typically the data might sorted, first, by the couple id, and next by gender, let's say with the male partner (sex=1) coming first, followed by the female partner (sex=2). For each person, there is a variable "income", and now you want to create a variable "pincome" that represents the partner's income. You will proceed as follows:

generate pincome = income[_n+1] if sex==1
replace pincome = income[_n-1] if sex==2

The expression [_n+1] means that the value of the variable "income" of the following row (=the male's partner) is to be allocated to the variable "pincome"; [_n-1] will work in analogous ways. You may also use numbers other than 1 if cirumstances demand it.

Extensions

A number of more complex possibilities have been implemented in the "egen" command. Here are a few I found helpful.

egen nkids = anycount(pers1 pers2 pers3 pers4 pers5), value(1)

Assume that there are data for five household members stored in variables "pers1" to "pers5", classified by age, with a value of 1 indicating that it is a child under 14. Now variable "nkids" will yield the number of children among these five household members.

egen v323r = rank(v323)

Variable v323r will contain the rank of each case in v323. In the case of ties, each case will be assigned the mean rank.

egen myindex = rowmean(var15 var17 var18 var20 var23)

This will create, for each case (or row), a variable containing the mean of the variables in the list of variables. Note that if a case has one or several missing missing values, "rowmean" will be computed from those variables that have valid values. With the help of "rowmiss" (see next item) this may be changed.

egen nmiss = rowmiss(x1-x10 var15-var23)

This will store the number of missing values that occur in the variables listed in variable "nmiss". The function rownonmiss is available as well.
Note that if, for whatever reason, a variable is included twice (or even more frequently) in the list (as in rowmiss(x1 x2 x2 x3)) this will not result in an error. Rather, the variable will be indeed counted twice (or more frequently). In the example just presented, a case with a missing value in x2 would have a value of 2 in the pertinent variable.

egen nmiss = rowtotal(x1-x10 var15-var23)

This will compute the total (or sum) of the variables for each case (or row). Note, however, that missing values will be counted as zeros, and therefore this command should be used with very much caution. There is an option , missing, but this will only yield a missing value in the resulting variable if all variables that enter the computation have missing values.

egen incomst = std(income)

will create standardized values of the variable income. You may "standardize" proposing a mean and standard deviation different from the usual values of 0 and 1, respectively. To do so, use the options mean(value) and or std(value).

Combining `egen` with `by`

A number of functions for egen are particularly useful when combined with the by prefix, inasmuch as they refer to all cases, which, in combination with by, means all cases with the same value in the variable that follows the prefix. Thus,

bysort v3: egen mincome = mean(income)

will compute for each case the mean of variable income of the "group" defined by v3 to which the individual case belongs (e.g., mean income of household, district, country ... whatever v3 is standing for). A selection of further functions can be found here:

bysort varX: egen newvar=count(somevar)		The number of cases with valid values in somevar
bysort varX: egen newvar=max(somevar)		The maximum value of somevar
bysort varX: egen newvar=sd(somevar)		The standard deviation of somevar
bysort varX: egen newvar=median(somevar)		The median somevar
bysort varX: egen newvar=total(somevar)		The total of somevar (missing values are treated as 0 in the computation)
bysort varX: egen newvar=total(somevar), missing		The total of somevar (yields a missing value if all values of somevar are missing)

In the same vein, you may use the functions iqr (interquartile range), kurt (kurtosis), mad (median absolute deviation), mdev (mean absolute deviation), min (try and guess ...), skew (skewness).

Finally, let me mention a few more complex functions (complex only inasmuch as they need a little bit more of explanation):

bysort v3: egen pcincome = pc(income)

will compute for each case the percentage of the total of income. E.g., if one case (within v3!) has an income of 500 and a second case has an income of 1000, the first case will have a value of 33.3333 in variable pcincome, etc. You may use pc(income), prop instead, which will yield the same values expressed as proportions (i.e., .333333, etc., in our example).

bysort v3: egen pct30income = pctile(income), p(30)

will compute, and allocate to each case, the value of the 30th percentile of variable income (within v3).

There are one or two more of these functions which I have omitted here.