Merging Files

Merging files can mean two different things. You may ADD two or more files that (usually) contain different cases, but have at least partly the same variables. ADDing files means that all cases previously in separate files will end up in one file; that is, the resulting file will have more cases, but very often not more variables. This case usually is rather simple; for instance, you may have the same sort of data for several school classes and wish to have them in a single file.

You may MATCH two files that contain the same cases (at least in part), but have different variables. MATCHing these files often means that you will deal with the same cases as before, but you will have more information (more variables) about them. This case may be very simple as well – see the first example on MATCHing below –, but sometimes necessitates (or allows) to deal with quite sophisticated problems.

Simple example for ADDing files:

ADD FILE
	/ file = 'c:\subdir\mydata1.sav'
	/ file = 'c:\subdir\mydata2.sav'
	/ file = 'c:\subdir\mydata3.sav'.

Simple example for MATCHing files:

MATCH FILE
	/ file = 'c:\subdir\mydata1.sav'
	/ file = 'c:\subdir\mydata2.sav'
	/ by id.

Complex example for MATCHing files:

MATCH FILE
	/ file = 'c:\subdir\mydata1.sav'
	/ rename (var17 var23a var300 = v17 v23a var301a )
	/ drop var12 var313 d1
	/ table = 'c:\subdir\mydata2.sav'
	/ by id.

Go immediately to MATCH FILES.

ADD FILES

As explained in the introduction, adding files is normally a very simple operation. More specifically, this is the case if the structure of the files you wish to add indeed are identical (more on this below). Then the simple example from the first section will do its job. Up to ten files can be ADDed in one step by contatenating them with backslashes; each file will be stacked below the others. If there are more than ten files, you just have to write more than one ADD FILE command.

What ADDing files means can be shown by a simple example. Let's suppose you have grades in maths for the boys and the girls in your class, but these are in different files. Thus you have two files with different cases, but identical variables. For instance, part of the boys' data will look like this (grades are in German notation, where 1 means A, and 5 means F):

Name Maths
John 2
Mike 3

Likewise, part of the girls' file may look like this:

Name Maths
Lisa 2
Anne 1

Let's suppose that the boys' data are in file 'mboys.sav' which is located in subdirectory 'c:\class98', and the girls' data are in file 'mgirls.sav' in the same directory. Then ADDing these two files with the command

ADD FILE
	/ file = 'c:\class98\mboys.sav'
	/ file = 'c:\class98\mgirls.sav'.
EXE.

will yield the new file

Name Maths
John 2
Mike 3
Lisa 2
Anne 1

Note that you have to SAVE this file if you wish to keep it as one file that you can work with later on.

Often, one of the files you wish to ADD to other files will be your working file. You can refer to your working file by the '*' sign. You may place it anywhere in your list of files; that is, the working file does not have to be the first file in your list, as in the following case, where the first file addressed is 'data1.sav', followed by the current working file and then 'data2.sav'�:

ADD FILE
	/ file = 'c:\subdir\data1.sav'
	/ file = *
	/ file = 'c:\subdir\data2.sav'.
EXE.

CAUTION: If you use this possibility or referring to your working file, the resulting file will have the same name as the initial working file. Saving this file without a change of name will overwrite the old file. This may be very annoying if you have used the "drop" or the "keep" subcommand (see below).

Some additional complications may arise, but there are several possibilities to deal with these. What these complications may consist in becomes clear when we think about what it means for two files to have an identical structure. I refer to two files as having an identical structure if they contain the "same" variables. This means that (1) all variables that are in one file are also present in the other file(s) and that no file contains variables that are not present in other files, and that (2) all the variables in all files that have the same name are of the same type. If the first condition is violated – i.e. if a file contains variables that are not present in other files –, nothing serious will happen; only the resulting data set may be bigger than you wish, with the result that speed of operations will slow down. The second case will be detrimental, that is, the operation of ADDing files will not be performed. I will deal with these two cases in turn.

Violation of the "same variables" condition

Let's assume you have two data sets you wish to ADD. Data set 'data1.sav' contains variables var1 and var2, but data set 'data2.sav' in addition has variable var3. What happens if you ADD the two files? The resulting data set will have variables var1, var2 and var3, and the cases from 'data1.sav' will have system missing values in var3. Of course this may be precisely what you want, but if you do not need var3, it may be good practice to drop it from the resulting data set, because each 'useless' variable makes itself felt negatively as far as computational speed is concerned. This can be achieved easily like this:

ADD FILE
	/ file = 'c:\subdir\data1.sav'
	/ file = 'c:\subdir\data2.sav'
	/ drop var3.

Sometimes you may have (or wish) to drop so many variables that it is simpler to tell SPSS which variables you wish to KEEP:

ADD FILE
	/ file = 'c:\subdir\data1.sav'
	/ file = 'c:\subdir\data2.sav'
	/ keep var1 to var3.

Note that of course you may always use the "drop" or "keep" commands to get rid of variables you do not need.

A problem that may occur (if you have not prepared your data well, that is) is that you have the same variables in both data sets, but these are named differently. Let's assume that 'data1.sav' has the same information that in 'data2.sav' is stored in variables var1, var2 and var3. But unfortunately, in 'data1.sav' these variables were named var5, var6 and var4. You can amend this by RENAMing variables (note the order of variables!):

ADD FILE
	/ file = 'c:\subdir\data1.sav'
	/ rename (var5 var6 var4 = var1 var2 var3)
	/ keep var1 to var3.

Of course, you may combine RENAMEing and DROPping (or KEEPing) variables. For instance, 'data1.sav' may contain the data that should be in var2, but they are called var5. Instead, what is called var2 contains data that you do not need anymore. Thus, you will write:

ADD FILE
	/ file = 'c:\subdir\data1.sav'
	/ rename (var5 var2 = var2 var4 )
	/ file = 'c:\subdir\data2.sav'
	/ drop var4.

Violation of the "same type" condition

If two variables have the same name, but are of different type, SPSS will not perform the ADD FILES operation. You cannot amend this during the ADD FILES run. You have to change the type of the variable prior to ADDing files. The interesting thing is which variables will be treated as 'of the same type' and which will not.

I deal here only with the most usual cases, that is numeric and string variables. What is obvious is that "numeric" is not the same type as "string". What is less obvious is that string variables of different width will also be treated as different type! If SPSS aborts an ADD FILE procedure with an error message, you may consider this possibility.

What happens with numeric variables that have different formats? Since the different formats apply only to the way variables are displayed, SPSS finds nothing wrong in ADDing these variables. It will use the format that is used in the first file on the ADD FILES list for display in the resulting working file.

MATCH FILES

Matching files may be very easy. For instance, you may have in one file the maths grades of your class, and in another file the sports grades. Now you may wish to join these two (or perhaps more) files. I give an example:

Here's the maths grades:

Name	Maths
John	2
Mike	1

Here's the sports grades (you see that you can be good both in maths and in sports - certainly in imaginary data):

Name	Sports
John	2
Mike	1

If (IF!) you have prepared your data well, MATCHing these files will be achieved quite easily.

(EXAMPLE TO BE USED WITH EXTREME CAUTION):

MATCH FILE
	/ file = 'c:\subdir\maths.sav'
	/ file = 'c:\subdir\sports.sav'.

Here's the resulting file:

Name	Maths	Sports
John	2	2
Mike	1	1

However, things will work as smoothly only in a simple case like this, when the cases in both data sets are the same and they are in the same order. In addition, different variables indeed must be named differently. (If they have the same names, SPSS will use the information from the first data set; this situation can be amended, however.)

SPSS can deal with cases when one or two of these conditions are not met, however. The opportunities offered to deal with these situations allow for accomplishing fairly complex data management tasks with a few operations. Therefore, you may have a look at the following even if right now you don't have any such problems. Especially the use of the keyword BY should become a rule.

Keyword BY

Matching BY a given variable ensures that the information of the same cases in the different files will indeed be matched correctly. Usually, each case in a data set has a unique identification variable (henceforth ID), and this identification variable will be used in any data set where this case occurs. If the cases in each data set are sorted in ascending order by the ID variable(s) (indeed, there may be more than one ID variable!!), the data sets may be matched as follows (with "name" used as ID variable):

MATCH FILE
	/ file = 'c:\subdir\maths.sav'
	/ file = 'c:\subdir\sports.sav'
	/ BY name.

This will result in a 'meaningful' data set even if some of the cases in the two (or more) different data sets are not identical. Let's assume we have the two following files:

Name	Maths
John	2
Mike	1
Andy	3

Name	Sports
John	2
Mike	1
Rick	5

Matching these two files BY name will yield the following file:

Name	Maths	Sports
John	2	2
Mike	1	1
Andy	3	.
Rick	.	5

If you are annoyed by having imcomplete cases in your file, this situation can be partly amended with the TABLE keyword which now will be explained. Note that this keyword has an additional use.

Note: As described above (ADD command), you may refer to a file you have worked with prior to MATCHing (your working file) with the '*' sign. If you use this possibility of referring to your working file, the resulting file will have the same name as the initial working file. Saving this file without a change of name will overwrite the old file. This may be very annoying if you have used the "drop" or the "keep" subcommand, which are also applicable (and often very usefully so) with the MATCH command.

Keyword TABLE

The first use of the TABLE keyword is to deal with the situation explained above. If one (or more) files that are matched are not evoked with keyword FILE but rather with keyword TABLE, only those cases of this or these file(s) will be matched to the file(s) that are evoked as FILEs that actually are contained in the FILEs. Let's assume we match the two files we used above as follows:

MATCH FILE
	/ file = 'c:\subdir\maths.sav'
	/ table = 'c:\subdir\sports.sav'
	/ BY name.

The result will be the following file (with Rick being dropped from the file because he has no equivalent in the maths file, but with Andy still present and with a Missing Value in variable "Sports"):

Name	Maths	Sports
John	2	2
Mike	1	1
Andy	3	.

There can be more than one TABLE files, and more than one file that is addressed as FILE (in this case, all cases that are in any of the files addressed as FILES will end up in the resulting data set). However, at least one FILE keyword is necessary; that is, the cases of this file will be the minimum cases that will be contained in the resulting file.

The second use of keyword TABLE is as follows: If the file(s) that is (are) matched with keyword FILE contain more than one case with the same ID variable, and the file(s) with keyword TABLE has or have only one case for each ID, the variables of the TABLE file(s) will be matched to all cases in the FILE file. This is often useful if there is some aggregated information that is to be matched to other data. For instance, I have analyzed unemployment data on individuals, and I was wondering what effect the unemployment rate in the respective Bundesland ("State") has on the individual length of unemployement. Of course, the data set on individuals has many persons from each Bundesland. Thus, we may have the following file on individuals' duration of unemployment (extract):

ID	Land	Duration
1	1	5
2	1	10
3	1	2
4	2	6
5	2	11

The data on regional unemployment rates are in the following file

Land	Unem_rat
1	10.1
2	5.8

Since you wish to have the information of the unemployment rate on each individual in each Bundesland, you have to match the files by the variable "Land" (be sure that both files are sorted by this variable!). The command:

MATCH FILE
	/ file = 'c:\subdir\individ.sav'
	/ table = 'c:\subdir\unemprat.sav'
	/ BY land.

will yield the following file:

ID	Land	Duration	Unem_rat
1	1	5	10.1
2	1	10	10.1
3	1	2	10.1
4	2	6	5.8
5	2	11	5.8

Same variables in data sets to be matched

If the data sets to be matched contain the same variables - that is, variables with the same names - (other than any variables BY which the data sets are to be matched), SPSS will use the first occurrence of each variable. For instance, if var2 is in 'data1.sav' and also in 'data2.sav', and 'data1.sav' is mentioned first on the MATCH FILE statement, var2 from 'data1.sav' will be in the new data set and the information from var2 in 'data2.sav' will be lost. Often, this will be precisely what you want. In other cases (perhaps if the two variables, albeit having the same name, have different content) you have to RENAME the variables. The "more complex example" at the beginning of the page explains how to do this. (A RENAME keyword can follow any FILE or TABLE line and will refer only to that file.)

Variables in data sets to be matched that are not needed in the resulting file

Often, one (or several) data set(s) that are to be matched contain variables that are not needed for further analysis. These variables may be dropped from the resulting data set with keyword DROP (indeed). The "more complex example" at the beginning of this page has a line with an example for that case. (A DROP keyword can occur after any FILE or TABLE line and will refer to the respective file only.) The situation where there are many variables to be DROPped is one of the few instances where I use the SPSS DIALOG BOX, because you can easily click variables from one window ("new working data file") to the other ("exluded variables").