Cox Regression

Cox regression offers the possibility of a multivariate comparison of hazard rates. However, this procedure does not estimate a "baseline rate"; it only provides information whether this 'unknown' rate is influenced in a positive or a negative way by the independent variable(s) (or covariates). The procedure uses a Partial Likelihood estimation method ("partial" because not all the information in the data is used). One of its advantages is that it can incorporate time dependent covariates, that is, variables that change their value during the observation period.

Example for a model without time dependent covariates:

  / STATUS = ziel (1)
  / METHOD=ENTER var_x var_y
  / CONTRAST (var_y)=Indicator (2)
  / PATTERN BY var_y

After keyword COXREG, you first have to provide the duration variable and, after keyword STATUS, the status variable and the value(s) that indicate(s) whether or not each observation terminated with an event. The value in parantheses indicates events; all other values are considered as censored observations. Several values may be specified within the parantheses; however, all of these will be treated as indicating one single event (and not "competing risks" in the language of even history analysis). After METHOD=ENTER, the covariates to be included are listed.

In the next line, SPSS is told that variable var_y is to be treated as a categorical variable. The keyword INDICATOR in this line means that var_y is decomposed into a series of k-1 dummy variables (k being the number of categories of var_y) with the second category as the reference category. (Note that this is one of the many things that are not offered when working with SPSS's menu system, where you can only chose the first or the last category.) If there were another (or several other) categorical variable(s), you would include the respective number of CONTRAST lines. The advantage of telling SPSS that there are categorical variables and how to treat them consists not only in the automatic creation of dummy (or other ) variables; what is more important, SPSS will test the overall influence of the set of related (dummy or other) variables on the likelihood function.

Note that SPSS provides an easy way of including interaction effects: Let's assume that you want to test, in addition to the main effects of var_x and var_y respectively, the influence of the interaction of both variables. All you have to do is to add var_x by var_y in the METHOD=ENTER line.

The PLOT line requests displays of the survivor function and the cumulative hazard function. Even though these functions are not estimated directly, approximation procedures have been developed. SPSS will first display these functions at the mean vector of all covariates. The next line (PATTERN) in addition requests these plots for the different categories of var_y (at the mean of var_x). Note that these graphs are based on the estimates of the model. Thus, the LML plot (Log-minus-log plot) will look as if the effects of the different levels of var_y were proportional to each other, because this assumption is built into the model. The next example will show how to display this plot in such a way that this assumption can be tested.

Finally, partial residuals and dfbeta statistics are added to the data set. These statistics can help you to recognize cases that are not well accounted for by the model or that exert undue influence on the model. A useful way of dealing with these statistics is to plot them and to look at cases that stand out conspicuously from the remainder of the cases.

Example for a model with stratification:

  / STATUS = ziel (1)
  / METHOD = ENTER var_x
  / STRATA = var_y

With this command, a "stratified" model will be estimated. This means that var_y (the stratification variable) is not a covariate the influence of which is assessed; rather, a model will be estimated that allows for different baseline hazards for the different values of var_y. The graphs derived for the different values of var_y will now display the "real" values of the respective functions (as opposed to those estimated from the model). Thus, inspection of the LML plot will permit a rough judgement about whether the hazard rates for the different values of var_y are proportional to each other or not.

Example for a model with time dependent covariates:

COMPUTE kleink =
  (altk1 + t_ gt -6 and altk1 + t_ le 24) or
  (altk2 + t_ gt -6 and altk2 + t_ le 24) or
  (altk3 + t_ gt -6 and altk3 + t_ le 24) or
  (altk4 + t_ gt -6 and altk4 + t_ le 24).
  / STATUS = ziel (1)
  / METHOD = ENTER kleink var_x var_y

Incorporating time dependent covariates is one of the most exciting features of event history analysis, and it is important to think about this issue. Imagine you wish to test whether married people live longer than those who never married or whose marriage broke up. It would be dangerous to include, for instance, just a variable whether people ever married, and if so, whether they got divorced. Perhaps this way you may find that those who got divorced will live longest. While this may well be desirable, this result may be an artifact because those who live longer have a bigger chance of ever getting divorced. Therefore, marital status must be treated in a way the reflects the "marriage process" of the individuals under study.

To include time dependent covariates, these will have to be defined prior to your COX REGRESSION run in an extra section called TIME PROGRAM. Here, I have chosen an example where we wish to know whether, at any time during the process investigated, the individuals in our data set have children at age -6 months (meaning that the individuals or their partners were pregnant and the baby was to be given birth 6 months later) to 24 months. Up to 4 kids are taken into account (meaning that 4 kids are 'tested' whether they were in that age group during the process under investigation.)

How can we relate these data to the process? First, we need variables that tell us the age of the children at the beginning of the process. This may well be a negative age, e.g., -60 months, if a baby was indeed born 60 months after the process we are investigating (e.g., the individuals' occupational career) started. These variables here are called altk1, altk2, altk3 and altk4. Then, at each point in time, that is, at months 1, 2, 3, and so on, of the process, the kids' age has to be "updated" and to be compared to the age frame we have chosen. To achieve this, we have to use a variable called "t_" that is provided by SPSS and is related to process time. Thus, in our example, at each point in time, actual process time is added to the kids' age at the beginning, and if the result lies between -6 and 24, a dummy variable (called "kleink" in this example) will have the value 1. At all other points in time, this variable will have value 0. (Note that instead we might compute the number of kids between -6 and 24; the time program need not be restricted to a single "compute" statement, but may be fairly complex indeed. Also, up to 10 different time dependent covariates can be created. For instance, since children - or rather social norms regarding children - exert a different influence on women and men, in our example we might wish to create an interaction effect between the individuals' sex and the variable relating to the kids.)

The remainder of the COX REGRESSION command works just as in the case of no time dependent covariates, with the following exceptions: The plots of the survival and the cumulative hazard functions are not available; and SPSS cannot compute partial residuals.

© W. Ludwig-Mayerhofer, IGSW | Last update: 18 Jul 2005