Reading Raw Data (ASCII Format)

Frequently, data are available not as a SPSS system file, but rather as so called "raw data", i. e. a file that consists only of numbers (or perhaps other characters). Such a file is usually called an ASCII file, as it contains (or is supposed to) only numbers and characters that correspond to an international standard (the ASCII standard) and nothing else. Other names that sometimes are encountered for such files are DOS file or Text file.

Such data can come in two formats: In one format, the free format, there is a value (or some other character) for each variable and for each case, and the different columns are separated, usually by a blank character. Here, it is not necessary that all data values are stacked exactly below each other, as in the following example:

1 15 300 27
2 1 7 9
3 2 4000 1

In the second format, the fixed format, all variables are exactly in the same place, i. e. all values are stacked exactly below each other (with numbers right adjusted).

1 15  300 27
2  1    7 9
3  2 4000  1

The advantage of this format is that, given the appropriate precautions, you do not need separators between the different data lines, as in the following example with the same data re-arranged:

115 30027
2 1 7 9
3 24000 1

Such data can easily be read by SPSS and transformed into a system file. The free format requires less effort; however, you have to be very sure about how you have dealt with missing values. With fixed format, you have to go into greater effort to describe to SPSS what to do, but missing values are less of a problem.

Example for free format data:

DATA LIST file = 'c:\subdir\mydata.asc' free
	/ var1 var2 var3 var4.
EXE.

This example tells SPSS to read a data file in which each case has four variables named var1, var2, var3, and var4, respectively. (The command exe -- as abbreviation for execute -- is necessary for SPSS to start reading in the data. Another possibility is to add a command that requires SPSS to read the data, such as a frequencies command.) Note that you can use any variable name that corresponds to the SPSS conventions, i.e. the name should not be longer that 8 characters, should start with a a letter, and should not end with an underscore.

Example for fixed format data:

DATA LIST file = 'c:\subdir\mydata.asc' fixed
	/ var1 1 var2 2-3 var3 4-7 var4 8-9.
EXE.

The number following each variable name indicate the exact location (column numbers) of the variables. The example given corresponds to the third small data set above, that is, the data set with no separators between the columns.

Note that you can also read data "on the stream", that is, by simply including them in your syntax file. In this case, you do not need a reference to a data file, as in the example above; however, you have to inform SPSS where your data start and where they end. This is achieved in the following way:

DATA LIST fixed
	/ var1 1 var2 2-3 var3 4-7 var4 8-9.
begin data.
115 30027
2 1 7 9
3 24000 1
end data.

Of course you can also read free format data "on the stream".

Note that overall, you can read quite a lot of different types of data, such as calendar dates in several formats, weekdays, data in scientific notation or in hexadecimal code. The examples that follow refer to the two most common types of data, namely, numeric (numbers only, including decimal values) and alphanumeric (characters) data.

Free format data files

The advantage of free format files is that you simply have to give SPSS a list of the variable names and do not have to care about the exact location of your variables. However, you have to be absolutely sure that you have values (or characters) for all of your data. If you have some missing data which are simply represented by blanks (i.e. there is simply "nothing"), this will usually cause tremendous problems. Consider the following example:

1 15  300
2  1    7 9
3  2 4000  1

Here, in the first line there are only 3 values. SPSS will therefore read the first value of the second line as the fourth value of the first case in your data set. In this example, you will notice that something went wrong, as SPSS will terminate with the message "An unexpected end of file has been encountered in the middle of a case". This simply means that SPSS could not find for what it assumed to be the last case in your data the appropriate number of data values (i. e. four) and noticed that something was wrong. However, just in case that the number of overall missing values is exactly a multiple of the number of columns in your data, SPSS will not notice that something was wrong.

It is therefore advisable to be sure that each missing value is represented by an appropriate character, possibly a number. However, they may also be represented by other characters, for instance by a dot (which is used by SPSS for system missing values). SPSS will in this case read the dot, but it will issue a warning. If you have many dots, this will cause many warnings (you may request SPSS to suppress these warnings, but this may not be advisable, as you may miss something really important ...).

Note that you can also read other characters than numbers, for instance variables that consist of "letters". These variables are called "alphanumeric" variables. In this case, things get more complicated, as you have to tell SPSS which variables are numeric and which are alphanumeric. Moreoever, if your alphanumeric variables have more than one character, you also have to tell this SPSS, and in this case you also have to tell SPSS if your numeric variables have more than one column (SPSS will read all the columns, but it will assume that only one column is to be shown in your data spreadsheet).

Here are a few examples to clarify the syntax for these more complicated examples

DATA LIST free
	/ var1 (n) var2 (a) var3 (n).

Here we have three variables, the first one being numeric, the second one alphanumeric, the third one again numeric. All variables are assumed to consist of one column only. (If the alphanumeric variable has more columns, SPSS will terminate with an error message; if the number variables have more columns, SPSS will read all of them, but will exhibit only one column unless you change the format of your variable.)

DATA LIST free
	/ var1 (n) var2 (a2) var3 (n4.1).

Here we have three variables, the first one being numeric with one column only, the second one alphanumeric with two columns, the third one again numeric. This last variable consists of four colums, and the last column refers to a decimal value. Note that if you have decimal values, the column before the decimals is reserved for the decimal separator, i.e. a dot or a comma, according to your national preferences. That is, the instruction (n4.1) assumes that there are two columns before the decimal separator, and one decimal value.

Fixed format data files

Fixed format files are somewhat more difficult to handle, as you have to specify the exact position of each variable. However, in this case, missing values represented by blanks will not be a problem.

Even though fixed format data require you to indicate the exact location of each variable, this is not so much of a problem if the data set does not contain many variables. But even in this rather simple case, there is a small problem if you have decimal values. I will elaborate on this in the following examples.

To read the following data is very simple:

1 15  300 27
2  1    7 9
3  2 4000  1

It just requires the following command:

DATA LIST file = 'c:\subdir\mydata.asc' fixed
	/ var1 1 var2 3-4 var3 6-9 var4 11-12.

with the numbers after the variable names indicating the exact colum position of the variables in the data set. You now can see why you would not neet blanks between the variables.

One problem with this arrangement is that SPSS, with these commands, assumes that the variables do not have decimal values. If there are decimal values, SPSS will read them, but will not exhibit them in the data window. One solution is to change the format by which the data are exhibited, using the format command. For instance, if var3 in our above example would consist of data with two columns before the decimal dot, one dot, and one decimal value, you might indicate this after having read the data with the following command:

format var3 (f4.1).

The part in parentheses, by beginning with an f, tells SPSS that this is a numeric variable. The next number indicates the overall columns occupied by this variable, and the number after the dot indicates how many decimal values to display. A alphanumeric variable with three columns would be indicated by (a3).

A different possibility is to indicate the format when reading the data. However, if the columns are separated by blanks, you have to be careful. In the above example, again assuming that var3 has four columns, one of which being the decimal dot and one the decimal value, you must also account for the empty column between var2 and var3. It would therefore be wise to read the data in the following way:

DATA LIST file = 'c:\subdir\mydata.asc' fixed
	/ var1 1 var2 3-4 var3 (f5.1) var4 11-12.

As you can see, both formats -- indicating the exact column and indicating the format of a variable -- can be mixed. But be sure to remember that if you use an indication of the format, SPSS will assume that the variable will start immediately after the last column of the previous variable.