Merge and Append
These commands enable you to combine two or more data sets into a single set of data. There are two constellations:
- You may have two (or more) data sets about a number of observations (cases, objects). Think about all the countries in the European Union, or all children in a school. The two data sets ideally contain information about exactly the same cases, but the information in the data sets will be different (if not, they're the same data). For instance, one data set may contain data about the EU countries' employment situation, the other data set may be about taxes and social security.
This case is dealt with using themerge
command. - Contrast this with the situation of several data sets that contain (more or less) the same variables, but refer to different observations (cases, objects). For instance, you may have data about countries' employment situation, with cases distributed over several data sets referring to, say, Europe, Northern Africa, Central Africa, and Northern/Central America.
To unite these data sets, Stata offers theappend
command. This is dealt with at the bottom of this entry.
The syntax for merging has changed as of Stata version 11. The old syntax (described further below) will also work with newer versions; that is, you don't have to rewrite all your older do files. Stata will just tell you that you have used old syntax. The new syntax is much clearer, however.
Merge (Stata Version 11 or higher)
Basics
When merging datasets, you will try to match different information about the same cases, information that for some reason or other is stored in more than one data set (e.g., because one part of the information was collected earlier on and additional information has been obtained later). The challenge, of course, is to merge the datasets in such a way that information about case A in dataset X is matched to the information about the same case in dataset Y, and so on. The prerequisite for accomplishing this is to have, in each of the datasets involved, one or several variables that uniquely identifies or identify each case; such variables are called "key variable". In the following I assume the presence of only one key variable, called "caseid" in my examples.
If both data sets contain one row per case, you just write:
merge 1:1 caseid using name-of-second-dataset
Here, "name-of-second-dataset" (called the "using dataset" by the Stata people) is merged to the data in memory (called the "master dataset"), assuming that each value of variable "caseid" is present only once in each of the data sets. Not each caseid in one dataset has to have an equivalent in the other dataset, but unless there is a certain amount of overlap you will not be inclined to merge data.
In contrast,
merge m:1 caseid using name-of-second-dataset
assumes that for each value of caseid in the using dataset there may be more than one case with this caseid in the master dataset. The information from a case in the using dataset will be merged to each case in the master dataset with the same caseid. This is called a "many-to-one merge". A typical example would be data collected from (many) individuals living in different countries, stored in the master dataset. You could add information about those countries (say, GNP, percentage of expenses on public childrearing, share of female earnings) to the individual data, with values for each country in the using dataset where each country represents a single case. Here, "caseid" would refer to a country identifier whose values will be shared by several or many individuals, but which uniquely identifies each country in the using data set.
Similarly,
merge 1:m caseid using name-of-second-dataset
will perform a "one-to-many merge", assuming that a caseid may occur several times in the using dataset, but only once in the master dataset. This is the same constellation as before, the only difference being that the currently active dataset (the master dataset) contains information about the higher level units (e.g., countries) and the lower level information (e.g., individuals) is found in the "using dataset".
Note finally that
merge m:m caseid using name-of-second-dataset
will perform a "many-to-many merge", but there are few occasions in which this could be useful. Also,
merge 1:1 _n using name-of-second-dataset
will match two data sets "by order", that is, the first case of the first dataset is matched to the first case in the second one, the second case to the second, and so on. This may work in some cases, but as normally each dataset has an ID variable, it's rarely worth risking to create nonsensical data this way.
A final note, just to make sure: In the examples above, I used the variable name "caseid" as a key variable on which data are matched. Of course, in real data the name(s) of the ID-variable(s) will differ widely. Note that there is no way to change the name of the key variable "on the fly"; in other words: The key variable must have the same name in the datasets involved prior to matching.
Finally, it should be stated explicitly that you can perform matches based on more than one key variable, as has been hinted at before.
Retaining only selected variables from the "using" file
If you wish to add only a subset of the variables that can be found in "using" file, you can tell Stata via option keepusing
, as in the following example
merge 1:1 caseid using name-of-second-dataset, keepusing (v17 v22 v25)
Variable "_merge"
With each merge, a variable called _merge
is added to the dataset, indicating the status of each case. Possible values of _merge
in the standard case are:
- _merge = 1: A case that was present in the master dataset only.
- _merge = 2: A case that was present in the using dataset only.
- _merge = 3: A case that was present in both datasets.
Note that the variable _merge
has to be deleted before a new merge is performed; otherwise, an error message will result. However, there are ways to deal with this 'problem':
- option
, generate mvarname
causes stata to name the merging variablemvarname
instead of_merge
- option
, nogenerate
will suppress creation of the merging variable.
Duplicate variables
A variable with the same name may occur in both datasets. Stata's default way to deal with this problem is to keep the values of this variable from the master dataset and to delete the variable from the using dataset.
Again, there are ways to treat this cases differently with the help of options.
- With option
, update
missing values in the master dataset will be replaced by values from the using dataset. - With option
, replace
values in the master dataset will be replaced if there is a valid value in the using dataset. This option must be used together with, update
.
Note that the variable _merge
can have some additional values if updates or replacements occur. See the Stata help for more information.
Other options
-
, sorted
informs Stata that the datasets are already sorted by the matching variable(s). -
, keep match
will cause Stata to keep only matching cases in the resulting dataset, in other words, cases with a value of 3 in variable_merge
.
Some more options are available that cover rather special problems.
Merge (up to Stata Version 10)
Assume you have opened a data set with a number of variables. For the same individuals, a number of different variables is contained in another data set, named "cati_beko". Individuals in both data sets are identified by variable "caseid". Both data sets are to be put together alongside each other so that for each individual all variables from both data sets are present in a single row. In other words, the data sets have to be merged. The command
merge caseid using cati_beko, unique
will accomplish this.
"Using" may be followed by the names of more than one data set. Note that when merge is used more than once, you will have to drop
the variable _merge
that is created during the merging process before executing another merge
.
Options
Above, I have used option unique
which means that the identifying variable refers to a single case both in the data set in memory (also called master data set) and the data set referred to by "using" (called using data set). If this assumption is not true, you may use uniqmaster
or uniqusing
, which indicates that there are unique observations in the master or the using data set only. However, multiple cases are a tricky issue from which I refrain for the moment.
Both data sets need not contain exactly the same cases. Option nokeep
causes Stata to drop observations in the using dataset that do not match. If one or both data sets are not sorted, use option sort
.
There are some other options that refer to technical details of labels, the "merge" variable created in the process of merging, and so on. Please refer to the Stata help for detail.
As far as I know (I never tried), you may match two data sets without any option. In this case, the first case of the master data set will be merged with the first case of the using data set, and so on.
Append
append using dataset2
will append dataset2 to your working file (the master dataset). Actually, more than one data set may be appended within one run of append
. I have found no indications in the handbook as to a limitation of the number of data sets to be appended.
Option generate(newvar)
will create variable "newvar" with values as follows:
- newvar = 0: A case from the first dataset (the master dataset).
- newvar = 1: A case from the second dataset (the first using dataset).
- newvar = 2: A case from the third dataset (the second using dataset).
and so on.
Note that there are some further options, particularly for dealing with the problem of data sets whose structure is not really identical. For instance, the same variable may have a different data type or a different label in two (ore more) data sets. Please refer to the handbook for such cases.
© W. Ludwig-Mayerhofer, Stata Guide | Last update: 8 Feb 2024