Merge and Append
These commands enable you to unite two or more data sets. There are two constellations: Either two data sets refer to (more or less) the same observations (cases, objects) but contain different variables. This case is dealt with by the "merge" command. Or two data sets contain (more or less) the same variables, but refer to different observations (cases, objects). To deal with this case, "append" is the command to use. "Append" is dealt with at the bottom of this entry.
The syntax for merging has changed as of Stata version 11. The old syntax (described further below) will also work with newer versions; that is, you don't have to rewrite all your older do files. Stata will just tell you that you have used old syntax. But the new syntax is much clearer.
Merge (Stata Version 11 or higher)
When merging datasets, you will try to match different information about the same cases, information that for some reason or other is stored in more than one data set (e.g., because one part of the information was collected earlier on and additional information has been obtained later). The challenge, of course, is to merge the datasets in such a way that information about individual A in dataset X is matched to the information about the same individual in dataset Y, and so on. The prerequisite for accomplishing this is to have in each of the datasets involved one or several variables that uniquely identifies or identify each case; such variables are called "key variable". In the following I assume the presence of only one key variable, called "caseid" in my examples.
If both data sets contain one row per case, you just write:
merge 1:1 caseid using name-of-second-dataset
Here, "name-of-second-dataset" (called the "using dataset" by the Stata people) is merged to the data in memory (called the "master dataset"), assuming that each value of variable "caseid" is present only once in each of the data sets. Not each caseid in one dataset has to have an equivalent in the other dataset, but unless there is a certain amount of overlap you will not be inclined to merge data.
merge m:1 caseid using name-of-second-dataset
assumes that for each value of caseid in the using dataset there may be more than one case with this caseid in the master dataset. The information from a case in the using dataset will be merged to each case in the master dataset with the same caseid. This is called a "many-to-one merge". A typical example would be data collected from (many) individuals living in different countries, stored in the master dataset. You could add information about those countries (say, GNP, percentage of expenses on public childrearing, share of female earnings) to the individual data, with values for each country in the using dataset where each country represents a single case. Here, "caseid" would refer to a country identifier whose values will be shared by several or many individuals, but which uniquely identifies each country in the using data set.
merge 1:m caseid using name-of-second-dataset
will perform a "one-to-many merge", assuming that a caseid may occur several times in the using dataset, but only once in the master dataset. This is the same constellation as before, the only difference being that the currently active dataset (the master dataset) contains information about the higher level units (e.g., countries) and the lower level information (e.g., individuals) is found in the "using dataset".
Note finally that
merge m:m caseid using name-of-second-dataset
will perform a "many-to-many merge", but there are few occasions in which this could be useful. Also,
merge 1:1 _n using name-of-second-dataset
will match two data sets "by order", that is, the first case of the first dataset is matched to the first case in the second one, the second case to the second, and so on. This may work in some cases, but as normally each dataset has an ID variable, it's rarely worth risking to create nonsensical data this way.
A final note, just to make sure: In the examples above, I used the variable name "caseid" as a key variable on which data are matched. Of course, in real data the name(s) of the ID-variable(s) will differ widely. Note that there is no way to change the name of the key variable "on the fly"; in other words: The key variable must have the same name in the datasets involved prior to matching.
Finally hat you can perform matches based on more than one matching variable.
Retaining only selected variables from the "using" file
If you wish to add only a subset of the variables that can be found in "using" file, you can tell Stata via option
keepusing, as in the following example
merge 1:1 caseid using name-of-second-dataset, keepusing (v17 v22 v25)
With each merge, a variable called
_merge is added to the dataset, indicating the status of each case. Possible values of
_merge in the standard case are:
- _merge = 1: A case that was present in the master dataset only.
- _merge = 2: A case that was present in the using dataset only.
- _merge = 3: A case that was present in both datasets.
Note that the variable
_merge has to be deleted before a new merge is performed; otherwise, an error message will result. However, there are ways to deal with this 'problem':
, generate mvarnamecauses stata to name the merging variable
, nogeneratewill suppress creation of the merging variable.
A variable with the same name may occur in both datasets. Stata's default way to deal with this problem is to keep the values of this variable from the master dataset and to delete the variable from the using dataset.
Again, there are ways to treat this cases differently with the help of options.
- With option
, updatemissing values in the master dataset will be replaced by values from the using dataset.
- With option
, replacevalues in the master dataset will be replaced if there is a valid value in the using dataset. This option must be used together with
Note that the variable
_merge can have some additional values if updates or replacements occur. See the Stata help for more information.
, sortedinforms Stata that the datasets are already sorted by the matching variable(s).
, keep matchwill cause Stata to keep only matching cases in the resulting dataset, in other words, cases with a value of 3 in variable
Some more options are available that cover rather special problems.
Merge (up to Stata Version 10)
Assume you have opened a data set with a number of variables. For the same individuals, a number of different variables is contained in another data set, named "cati_beko". Individuals in both data sets are identified by variable "caseid". Both data sets are to be put together alongside each other so that for each individual all variables from both data sets are present in a single row. In other words, the data sets have to be merged. The command
merge caseid using cati_beko, unique
will accomplish this.
"Using" may be followed by more than one data set. Note that when merge is use used more than one, you will have to
drop the variable
_merge that is created during the merging process.
Above, I have used option
unique which means that the identifying variable refers to a single case both in the data set in memory (also called master data set) and the data set referred to by "using" (called using data set). If this assumption is not true, you may use
uniqusing, which indicates that there unique observations in the master or the using data set only. However, multiple cases are a tricky issue from which I refrain for the moment.
Both data sets need not contain exactly the same cases. Option
nokeep causes Stata to drop observations in the using dataset that do not match. If one or both data sets are not sorted, use option
There are some other options that refer to technical details of labels, the "merge" variable created in the process of merging, and so on. Please refer to the Stata help for detail.
As far as I know (I never tried), you may match two data sets without any option. In this case, the first case of the master data set will be merged with the first case of the using data set, and so on.
append using dataset2
will append dataset2 to your working file (the master dataset). Actually, more than one dataset may be appended within one run of
append. I have found no indications in the handbook as to a limitation of the number of datasets to be appended.
generate(newvar) will create variable "newvar" with values as follows:
- newvar = 0: A case from the first dataset (the master dataset).
- newvar = 1: A case from the second dataset (the first using dataset).
- newvar = 2: A case from the third dataset (the second using dataset).
and so on.
Note that there are some further options, particularly for dealing with the problem of datasets whose structure is not really identical. For instance, the same variable may have a different data type or a different label in two (ore more) datasets. Please refer to the handbook for such cases.
© W. Ludwig-Mayerhofer, Stata Guide | Last update: 28 Sep 2020