COMPUTE

This command does what is says: It computes. Put more formally: It performes mathematical operations on variables and numbers. The result is a (usually new) variable that is added to the data set. Fairly complex operations may be performed.


Introduction and a few examples

The general syntax of the COMPUTE command is:

COMPUTE newvar = expression.

That is, on the left hand of the equation is the name of a variable you wish to create, and on the right hand there is an expression how this variable is to be created. The expression may contain operators, variable names and numbers. Note that the variable you wish to create may have the name of an existing variable; in that case the values of the existing variable will be replaced by those you have COMPUTEd.
Again, the best way is learning by example.

COMPUTE houshinc = inc1 + inc2 + inc3 + inc4 + inc5.

This is the most simple type: Some operations are performed on variables. Here, we have the income of up to 5 household members in variables inc1 to inc5. The household income is computed by adding these 5 incomes. NOTE that if there are values that are defined as MISSING VALUES in the variables on the right hand, no household income will be computed.

Instead of variables, you may also use constants, i.e. numbers. For instance, if you have a variable, say, var17, ranging from -2 to +2 and you want it to range from 1 to 5, you can use the following command:

COMPUTE var17 = var17 + 3.

(Note that I have used the same variable on the left hand and on the right hand of the '=' sign. That is, this command will augment all values of var17 by 3. In this case, you can easily restore the initial variable. However, this may not always be possible, therefore this possibility should used with some caution.)

In addition to the usual arithmetic operators, many other operators are available. These will be explained in more depth below. Here is just an example:

COMPUTE var18 = RND(var17).

rnd stands for rounding. If var17 has decimal values, these will be rounded to the next integer.

The operations can be combined in fairly complex ways. Be sure to use parentheses to clarify the order of operations. (I often use parentheses simply for sake of clarity.) Here's an example:

COMPUTE newvar = ((var1 + 5) * var17) / (ABS(var318) * SQRT(var29)).

Here, 5 is added to var1 and the sum is multiplied by var17. The result then is divided by the product of the absolute value of var318 and the square root of var29. Note that if var29 contains values for which the square root is not defined, the new variable will have a system missing value.

The remainder of this page is organized as follows: I give tabular overviews of many – but by no means all! – important functions that are available. In some cases where the use of the functions will not be immediately clear to the novice, I will provide examples. On the bottom, I will say a few words about logical and relational operators; even though these normally are used with conditional transformations (IF . . .), there is a special case where these will be applied with the compute command.


Mathematical operators

+ Addition
- Subtraction
* Multiplication
/ Division
** Exponentiation

What to do with these should be fairly obvious.


Arithmetic functions

ABS(expr) Absolute value (-3.7 will become 3.7, for instance)
RND(expr) Rounding
TRUNC(expr) Truncation (decimal values are simply cut off)
MOD(expr, divid) Remainder of the first argument (expr) when divided by the second ("divid", which of course is a number).
SQRT(expr) Square root of expression.
LN(expr) Natural log of expression.

Note that "expression" does not have to be a variable; it may also consist of variables or numbers concatenated by operators and functions, such as in

COMPUTE v3 = ABS(v1 - v2).

Example: What is the use of MOD (for "modulus")? In some applications, the month of an event (say, marriage, or first job, or unemployment) is coded in "months since a certain date", e.g., since 1 / 1 / 1900. Thus, if the event was in January 1900, the month will be coded as 1, if it was in December 1900, the code will be 12, January 1902 will be 25, or March 1998 will be coded as 1179. Now, for instance, you may wish to know whether the risk to get unemployed is higher in certain months, for instance in late autumn and early winter (a reasonable assumption at least in some occupations). This information can easily be extracted by the following operation (assuming that the variable "start of unemployment" is called startun):

COMPUTE monthun = MOD(startun,12).

Variable monthun will now contain values 1 if unemployment started in January, 2 if it started in February, and so on; December will be coded as 0 (zero). Now you can process this variable further to yield the specific month(s) you may be interested in.


Statistical functions

Statistical functions operate on several variables or functions.

SUM(expr1, expr2...) Sum of the expressions enclosed in the parentheses
MEAN(expr1, expr2...) Mean of the expressions enclosed in the parentheses
SD(expr1, expr2...) Standard deviation of the enclosed in the parentheses
VARIANCE(expr1, expr2...) Variance of the expressions enclosed in the parentheses
CFVAR(expr1, expr2...) Coefficient of variance of the expressions enclosed in the parentheses
MIN(expr1, expr2...) Minimum value of the expressions enclosed in the parentheses
MAX(expr1, expr2...) Maximum value of the expressions enclosed in the parentheses

Note that "expression" does not have to be a variable; it may also be an expression of variables or numbers concatenated by operators and functions, such as in

COMPUTE var4 = SUM(ABS(var1 * var2), var3, var5).

This will compute the absolute value of the product of var1 and var2. This value then will be added to the values of variables var3 and var5. Here as in most other cases, on the left hand of the equation there may be the name of a variable that is also used on the right hand. Usually, this will be dangerous, but perhaps it is just what you want!

Example: The MEAN (. . .) function is useful in computing indices if all items that enter into the index are on the same scale and are supposed to be weighted equally (an assumption that is made as regularly as it is usually not warranted).

COMPUTE newvar = MEAN(var17 to var19, var25).

The mean of the variables var17, var19, all variables in the data set ("physically") between var17 and var19 (which won't necessarily be var18!), and var25 will be stored in newvar. The advantage of this procedure is that the mean will be computed even if there are missing values in some of the variables (these variables will be simply disregarded). Its disadvantage is that it will be computed even if there are missing values in some of the variables. What I want to say is that often it is meaningful to compute an index even if there are one or two items missing, but usually it is not meaningful if too many items are missing.

Older versions of SPSS (e.g., version 6.0) do not care for that problem, it's you who has to care. Therefore, it will be good practice to count the number of missing values and do conditional transformations (see next section) on those cases that do not have too many missing values. E.g., if you have counted the number of missing values for variables var17, var18, var19 and var20 in variable indmiss, you might compute the index like that:

IF (indmiss LE 1) myindex = MEAN(var17, var18, var19, var20).

Newer versions (that is, 6.1 and later) offer a remedy, however. You may indicate the minimum number of variables with valid values you will accept, as in the following example (where cases with less than two variables with valid values will be omitted from the computation):

COMPUTE varnew = MEAN.2 (var17, var18, var19, var20).

Warning: The problem here described is even more pertinent if you want to compute indices by adding several variables with the help of the SUM(  ) operator. You might get small values for some cases simply because they have missing values. Therefore, great caution is needed in the application of this and related procedures if there are missing values in your data (which normally is the case – if you have a huge data set from a survey without missing values, you can be more or less hundred percent sure that something is wrong.)


Random number functions

Random numbers based on a great variety of distributions can be computed. I mention only the two most important.

RV.UNIFORM(min,max). Creates, for each case in the data set, a random variable which will be uniformly distributed between values "min" and "max" (note that you have to enter numbers in the place of "min" and "max").
RV.NORMAL(mean,sddev). Creates, for each case in the data set, a random variable which will be normally distributed with mean "mean" and standard deviation "sddev" (note that you have to enter numbers in the place of "mean" and "sddev").

These and the many other possibilities usually are of interest for the advanced user. However, in case you want to create a "jittered scatter plot", the RV.UNIFORM( ) function will be very helpful. Assume that you have two variables of the usual Likert item type, say, var39 and var45, with values ranging from 1 to 5 (meaning, e.g., "completely agree" to "completely disagree"). To create a jittered scatter plot, you may wish to add a small random number to each data value, as for instance:

COMPUTE var39j = var39 + RV.UNIFORM(-0.3, +0.3).
COMPUTE var45j = var45 + RV.UNIFORM(-0.3, +0.3).


Dates

Since a couple of years, SPSS can deal with variables in a specific "date" format. However, sometimes it is more convenient to have dates in separate variables coded as year, month and day of the respective date.

If the data are in "date" format, you can compute the difference between two dates in seconds as follows:

COMPUTE tdiff = laterdate - earlierdate.

To obtain the difference in days, this must be divided by seconds per minute, minutes per hour and hours per days, that is:

COMPUTE tdiffdays = (laterdate - earlierdate) / 60 / 60 / 24.

If dates are stored as three separate variables "year", "month" and "day", difference between two dates can be computed as follows (provided the original variables are of the numeric type):

COMPUTE tdiff = YRMODA(year2, month2, day2) - YRMODA(year1, month1, day1).

The resulting variable will give the number of days between the two dates.

Some users may feel more at ease if they have separate variables for year, month and day instead of a single variable referring to a date. Therefore, you can extract parts of a date in separate variables:

COMPUTE byear = xdate.year(bdate).
COMPUTE bmonth = xdate.month(bdate).
COMPUTE bday = xdate.mday(bdate).

There are some additional possibilities for extraction such as the day in the week and other stuff.


A special operator: LAG

Note: As of version 17.0, there is a special command that seems to deal with the problem for which the LAG operator was created, i.e. the command SHIFT VALUES.

The LAG operator refers to the row "above" each row. Thus, if we have two lines of data like this:

ID INCOME YEAR
    45000  1999
    43400  2000

we may compute

COMPUTE prvincom = LAG(income).

which will result in

ID INCOME YEAR PRVINCOM
    45000 1999
    43400 2000  45000

I hope this example makes clear on which occasions the LAG operator may be useful. In this example, we have several rows of data which refer to one and the same person; and we wish to compute for each person the income of the previous year. (Note that we have to take precautions to which I will return immediately.) Another example might be data from a household survey where each household member has contributed data. Thus, each row in the data set may represent one person, but in addition there may be a variable indicating the household people come from. Occasionally, you may wish to use other household members' attributes as variables in your individual level analysis; again, the LAG operator may be very useful. Other examples easily will come to your mind.

Of course, the initial example was too simple. A data set will have data from several individuals, and we have to be sure that the income of the last row of one person is not considered as the "previous" income in the first row of the next person. Thus, a better way to achieve our aim is to use the IF clause which will be explained in the next chapter on conditional transformations:

IF (id EQ LAG(id)) prvincom = LAG(income).

As you can see, you may refer to LAG(id) implicitly; you don't have to compute explicitly a variable like "prvid".

It should be obvious that for the very first row in the data set, any expression containing LAG is not defined. In our example, this is no problem, as for each person (and thus also for the first person) there is no previous income for the first row. But in other circumstances, problems may arise. For instance, you may wish to define the first row of each person in the data set by comparing it to the previous id (if current id is not equal to LAG(id), we have the first row of an individual). The very first individual, however, cannot be compared to LAG(id). Here we may exploit the fact that successive commands build on each other:

COMPUTE first = 1.
IF (id EQ LAG(id)) first = 0.

Even though the first command line makes SPSS compute the value 1 for variable "first" for each row of the data set, the second line will correct this and produce zeros in all rows but the first of each person. At the same time, we got round the 'very first row problem'.

TIP: Sometimes you may wish to use the 'reverse' of the LAG operator. In fact, such an operator exists. SPSS has hidden it in a special section about time series data. Suppose you want not to define the first but rather the last data row for a person (or whatever your research unit is). The command to compare a variable (I will again use the variable 'id' supposed to represent the identification number of the unit) to the value of the variable in the next row is as follows:

CREATE nextid = LEAD(id,1).

This means that the variable nextid will contain the value of the id of the very first row to follow (that's what the number 1 after id says). That is, obviously you might also create new variables referring to the second next row and so on (perhaps there is an upper limit; I did not investigate this in more detail). There is also a drawback of the CREATE command, however: An 'implicit' use of the LEAD operator (or other operator that can be used with CREATE) as in the earlier example is not possible. Also, with very large data sets you may encounter problems concerning the available workspace (which may be circumvented by enlarging your workspace with the command SET WORKSPACE=XXXXXX where xxxxxx has to be replaced by an appropriate number like 100000 or even more).


Logical and relational operators in COMPUTE commands

Logical operators are expressions like "and", "or", "not". Relational operators are expressions like "equal to", "greater than". These will be explained in more detail in the chapter on conditional transformations. However, they have to be mentioned here because of a specific feature of the COMPUTE command. This feature is the creation of a binary variable that has the value of 1 if the expression on the right hand is true and 0 if it is false. This is very useful for creating dummy variables.

Assume that we have a variable sector, indicating whether a person works in the primary (agrarian), secondary (industrial), or tertiary (service) sector, coded as 1, 2 and 3, respectively. If we wish to create two dummy variables for persons employed in the secondary and the tertiary sectors, this can be written as follows:

COMPUTE secsect = sector EQ 2.
COMPUTE tertsect = sector EQ 3.

That is, all cases that have value 2 in the variable "sector" will have a 1 in variable "secsect", and all others will have a 0. The analogy to the variable "tertsect" is obvious.

Again, more complex examples are easily available. Remember that above I have explained how the modulus function may be used to compute the month in which an event happened. The example rendered a value of 11 for November and of 0 for December. Assume we are interested only whether or not unemployment occurred in November or December, but not the specific month. Thus, we might compute a dummy variable like this:

COMPUTE unovdec = MOD(startun,12) EQ 11 OR MOD(startun,12) EQ 0.

© W. Ludwig-Mayerhofer, IGSW | Last update: 22 Jul 2005