WLM Stata - Merge and Append

Merge and Append

These commands unite two or more data sets. There are two constellations: Either two data sets refer to (more or less) the same observations (cases, objects) but contain different variables. This case is dealt with via the "merge" command. Or two data sets contain (more or less) the same variables, but refer to different observations (cases, objects). To deal with this case, "append" is the command to use. "Append" is dealt with at the bottom of this entry.

The syntax for merging has changed as of Stata version 11. The old syntax for merging (described further below) will also work with newer versions; that is, you don't have to rewrite all your older do files. Stata will just tell you that you have used old syntax. But the new syntax is much clearer.

Merge (Stata Version 11 or higher)

Basics

Typically, you will have one variable that uniquely identifies each case. More than one identification variable may be present and Stata can deal with this case as well; but in the following I assume the presence of only one such variable, called "caseid" in my examples.

If both data sets contain one row per case, you just write:

merge 1:1 caseid using name-of-second-dataset

Here, "name-of-second-dataset" (called the "using dataset" by the Stata people) is merged to the data in memory (called the "master dataset"), assuming that each value of variable "caseid" is present only once in each of the data sets. Not each caseid in one dataset has to have an equivalent in the other dataset, but unless there is a certain amount of overlap you will not be inclined to merge data.

In contrast,

merge 1:m caseid using name-of-second-dataset

assumes that for each value of caseid in the master dataset there may be more than one case with this caseid in the using dataset. The information from a case in the master dataset will be merged to each case in the using dataset with the same caseid. This is called a "one-to-many merge".

Similarly,

merge m:1 caseid using name-of-second-dataset

will perform a "many-to-one merge", assuming that a caseid may occur several times in the master dataset, but only once in the using dataset.

Note finally that

merge m:m caseid using name-of-second-dataset

will perform a "many-to-many merge", but there are few occasions in which this could be useful. Also,

merge 1:1 _n using name-of-second-dataset

will match two data sets "by order", that is, the first case of the first dataset is matched to the first case in the second one, the second case to the second, and so on. This may work in some cases, but as normally each and every dataset has an ID variable, it's rarely worth risking to create nonsensical data.

In the examples above, I used the variable name "caseid" to match cases. Of course, the name of this variable is your choice. Note that you can perform matches based on more than one matching variable.

The variable _merge

With each merge, a variable called _merge is added to the dataset, indicating the status of each case. Possible values of _merge in the standard case are:

_merge = 1: A case that was present in the master dataset only.
_merge = 2: A case that was present in the using dataset only.
_merge = 3: A case that was present in both datasets.

Note that the variable _merge has to be deleted before a new merge is performed; otherwise, an error message will result. However, there are ways to deal with this 'problem':

option , generate mvarname causes stata to name the merging variable mvarname instead of _merge
option , nogenerate will suppress creation of the merging variable.

Duplicate variables

A variable with the same name may occur in both datasets. Stata's default way to deal with this problem is to keep the values of this variable from the master dataset and to delete the variable from the using dataset.

Again, there are ways to treat this cases differently with the help of options.

With option , update missing values in the master dataset will be replaced by values from the using dataset.
With option , replace values in the master dataset will be replaced if there is a valid value in the using dataset. This option must be used together with , update.

Note that the variable _merge can have some additional values if updates or replacements occur. See the Stata help for more information.

Other options

, sorted informs Stata that the datasets are already sorted by the matching variable(s).
, keep match will cause Stata to keep only matching cases in the resulting dataset, in other words, cases with a value of 3 in variable _merge.
, keepusing var17 var100 var101 will tell Stata to retain only variables var17 var100 var101 from the using dataset for the resulting dataset.

Merge (up to Stata Version 10)

Assume you have opened a data set with a number of variables. For the same individuals, a number of different variables is contained in another data set, named "cati_beko". Individuals in both data sets are identified by variable "caseid". Both data sets are to be put together alongside each other so that for each individual all variables from both data sets are present in a single row. In other words, the data sets have to be merged. The command

merge caseid using cati_beko, unique

will accomplish this.

"Using" may be followed by more than one data set. Note that when merge is use used more than one, you will have to drop the variable _merge that is created during the merging process.

Options

Above, I have used option unique which means that the identifying variable refers to a single case both in the data set in memory (also called master data set) and the data set referred to by "using" (called using data set). If this assumption is not true, you may use uniqmaster or uniqusing, which indicates that there unique observations in the master or the using data set only. However, multiple cases are a tricky issue from which I refrain for the moment.

Both data sets need not contain exactly the same cases. Option nokeep causes Stata to drop observations in the using dataset that do not match. If one or both data sets are not sorted, use option sort.

There are some other options that refer to technical details of labels, the "merge" variable created in the process of merging, and so on. Please refer to the Stata help for detail.

As far as I know (I never tried), you may match two data sets without any option. In this case, the first case of the master data set will be merged with the first case of the using data set, and so on.

Append

append using dataset2

will append dataset2 to your working file (the master dataset). Actually, more than one dataset may be appended within one run of append. I have found no indications in the handbook as to a limitation of the number of datasets to be appended.

Option generate(newvar) will create variable "newvar" with values as follows:

newvar = 0: A case from the first dataset (the master dataset).
newvar = 1: A case from the second dataset (the first using dataset).
newvar = 2: A case from the third dataset (the second using dataset).

and so on.

Note that there are some further options, particularly for dealing with the problem of datasets whose structure is not really identical. For instance, the same variable may have a different data type or a different label in two (ore more) datasets. Please refer to the handbook for such cases.

Internet Guide to Stata

Print article