Wolfgang Ludwig-Mayerhofer's Introduction to TDA

Analysis without time-dependent covariates

The first point I have raised refers to the zero duration in the data, which is found in case # 88. Using the data as they stand, you will find that TDA will terminate with an error message. Although this is not absolutely necessary when estimating nonparametric or semiparametric models (in which the order of durations, and not the exact time elapsed, is important), we may wish to estimate parametric models later in which zero values have to be avoided at all events. Thus, we shall discuss this matter briefly.

Clearly, durations of zero cannot arise in "real life". To amend the problem, we may do two different things: Change the single value that causes trouble by adding a small constant (for instance, 0.01, as this will barely affect the estimates even in the case of continuous time models) or add a constant to all durations. The latter may be justified by the following line of reasoning: Obviously, in the data we are dealing with, the measurement unit with respect to time was a month. Therefore, a duration of zero may have occurred if a student has entered and left college in the same month. More generally, durations may have been obtained by subtracting the month of entering college from the month of finishing or dropping out of college. (Of course, this is only guessing. There may be other reasons for the zero value!)

Theoretical (Hujer & Schneider, 1986, 1989) as well as empirical (Petersen 1991b; Petersen & Koput 1992) arguments have been put forward that in the case of grouped or aggregate duration data which are rounded up to the next integer, a value of .5 should be subtracted from the durations, at least if not too much time dependence is present. On the assumption that a duration of zero means that dropping out of college occurred in the same month as entering college (and a duration of one that dropping out occurred in the month following the month of entrance, and so on), this would mean, on the contrary, that the durations were rounded off. Thus, a natural way of coding the data would be to add a value of 0.5. (Note that as long as we are employing Cox's Partial Likelihood model, we may change values as we like, provided that the order of durations, or arrival times, is not affected. Thus, we may either add a constant c from within the range 0 < c < 1 to the zero duration, or we may add any constant to all durations!)

However, in the following I will add a constant of 1 to all durations. This will make things easier later on when we will split the episodes. In substantial terms, this is not really justified, but given that the time frame for the entire analysis is about 48 months, this cannot change the results in any decisive ways, even though the "real" value of the durations may be exaggerated (very) slightly.

Here, I want to present part of TDA's output for a Cox regression model (when invoked with y152_1a.cf ) and explain the most important parts of this output.

> reading command file: y152_1a.cf

> nvar(...)
>---------------------------------------------------------

Creating new variables. Current memory: 260624 bytes.

(The following section repeats the data definitions from the nvar(   ); section.)

Idx Variable Label T S PFmt Definition
1 V1 ZERO 1 8 0.0 0
2 V2 DUR 3 4 0.0 c2+1
3 V3 DES 3 4 0.0 c3
4 V4 SEX 3 4 0.0 c4
5 V5 GRD 3 4 0.0 c5
6 V6 PRT 3 4 0.0 c6
7 V7 LAG 3 4 0.0 c7
8 V8 MRG 3 4 0.0 c8
9 V9 ST 3 4 0.0 c9
10 V10 PRT*LAG 3 4 0.0 V6*V7

Creating a new data matrix.
Maximum number of cases: 1000

Using data file(s): yam152_a.dat
Free format. Separation character(s): default.

Reading data file: yam152_a.dat

Created a new data matrix.
Number of cases: 265
Number of variables: 10
Missing values in data file(s): none.
---------------------------------------------------------
End of creating new variables. Current memory: 296913 bytes.

> edef(...)
---------------------------------------------------------
Creating new single episode data. Max number of transitions: 100.
Definition: org=V1, des=V3, ts=V1, tf=V2

(An overview of censored and not-censored observations and their mean durations follows.)

SN Org Des Episodes Weighted Mean
Duration
TS Min TF Max Excl
1 0 0 158 158.00 42.13 0.00 48.00 -
1 0 1 107 107.00 15.84 0.00 46.00 -
Sum 265 265.00

Number of episodes: 265
Successfully created new episode data.

> rate(...)=1
---------------------------------------------------------
Transition rate models. Current memory: 297062 bytes.

Model: Cox (partial likelihood)

Maximum likelihood estimation.
Algorithm 5: Newton (I)

Number of model parameters: 5
Type of covariance matrix: 2
Maximum number of iterations: 20
Convergence criterion: 1
Tolerance for norm of final gradient: 1e-006
Mue of Armijo condition: 0.2
Minimum of step size value: 1e-010
Scaling factor: -1

(The previous sections refers to default values of the Partial (or Maximum) Likelihood Estimation.)

Log-likelihood of exponential null model: -569.816
Changed scaling factor for log-likelihood: -0.001
Using default starting values.
Sorting episodes according to ending times.

Convergence reached in 6 iterations.
Number of function evaluations: 8 (8,8)

Maximum of log likelihood: -542.247
Norm of final gradient vector: 0.0001960793
Last absolute change of function value: 1.15129e-008
Last relative change in parameters: 0.001001282

(The previous two sections refer to the outcome of the estimation process. Convergence should have been reached, and the last three numbers shown should all be very close to zero, as is the case here.)

Idx SN Org Des MT Variable Label Coeff Error C/Error Signif
1 1 0 1 A V4 SEX 0.3252 0.2031 1.6015 0.8907
2 1 0 1 A V5 GRD 0.2847 0.0854 3.3351 0.9991
3 1 0 1 A V6 PRT 1.4618 0.3327 4.3935 1.0000
4 1 0 1 A V7 LAG 0.1269 0.0227 5.5823 1.0000
5 1 0 1 A V10 PRT*LAG -0.0862 0.0416 -2.0704 0.9616

(Note that the level of significance in the last column is given as 1-p).

Log likelihood (starting values): -567.0330
Log likelihood (final estimates): -542.2469

The output from the duration data module starts with the information that single episode data were used and provides information about the state and time variables. After that, a table lists all the transitions observed (including the censored cases, that is, "transitions" from 0 to 0) and their respective mean, minimum, and maximum durations. In our example, 158 censored cases and 107 transitions from "0" to "1" - that is, college drop outs - were observed. Next, the type of model is displayed, together with a lot of information pertaining to the estimation algorithm used and related issues (see chapter 5.6 of the TDA Manual on Maximum Likelihood Estimation). We will skip the technical issues here, but a few things may be pointed out. First, the estimation algorithm should have converged, which is the case in our example. (In most instances, if estimation has not converged after the default number of 20 iterations, something is wrong either with your data or your model, or both. However, this is only a rule of thumb which may not be appropriate in some special cases.) Also, the norm of the final gradient vector, the last absolute change of function value and the last relative change in parameters should deviate not too much from zero.

Finally, the results of the model estimation are displayed. First, you find the parameter estimates (column "coeff"), their standard errors, T-statistics (parameters divided by their respective standard errors) and the level of significance.. Note that TDA displays the probability that the parameter is different from zero, that is, when we accept a significance level of 0.05 we have to look for values that are greater than 0.95. The values are, as usual, rounded (there is no significance level of exactly 0). The parameter estimates are identical to those given by Yamaguchi, with the exception of the coefficient for sex, which is reported by Yamaguchi as 0.324 (see again his Table 6.4, p. 148, Model 1).

The values in the two last lines are important for computing the likelihood ratio statistic

LR = 2 (LL1 - LL0)

with LL1 being the log likelihood of the present model (i.e. "log likelihood final estimates") and LL0 the log likelihood of a model with no covariates, a so called null model. (Note that what is displayed actually is the log likelihood of the model with which the estimation started, which may be different from the null model when, for instance, you have provided starting values). The LR statistic has a chi-square distribution with degrees of freedom equal to the number of parameters omitted. In our example, we get

LR = 2 (-542.25 - -567.03) = 49.56

which is, apart from a small rounding error, identical to the value given in Yamaguchi's Table 6.4, p. 148 (Model 1). This value may be compared to the chi-square distribution, which gives a critical value of 11.1 at the .05 significance level for five degrees of freedom. This test says that the model at hand (with five parameters) can indeed explain significantly more of the "variation" in the dependent variable than a model with no information about covariates, i.e. a model assuming that the hazard rate is the same for all observations. (The test therefore has the same purpose as the overall F test statistic in an OLS regression or analysis of variance framework.)

Note that when estimating parametric hazard rate models, often an exponential null model serves as a point of reference, as for instance a Weibull modell with starting values already has some information built in concerning the so called shape parameter. The log likelihood of this exponential null model therefore is displayed in the output a little bit above the message "Convergence reached in xx iterations". But it is not meaningful in the context of a Cox regression model.

I may add that the likelihood ratio test more generally can be applied to compare two models, for instance, your present model with another model from which one or several of the parameters of the present model have been omitted or to which one or several other parameters have been added. In this case, you will compare the log likelihood values (final estimates) for these models.

Before proceeding to the analysis with time-dependent covariates, I want to explore some substantial aspects of the data. The list of covariates contains two dichotomous variables, SEX and PRT (part-time work), and two assumedly continuous variables, LAG (the time elapsed between finishing high school and starting college) and GRD (the average grades obtained in high school, ranging from 1 [mostly A grades] to 5 [mostly C grades]). But are these two variables really continuous? With respect to the grades, one may argue that this variable should rather be treated as ordinal. My position with respect to this problem is very simple: Create several dummy variables that represent different grades in a suitable way (which is easy here since the variable GRD has only 5 categories, and thus we take one of them as reference category and make dummies out of the four other ones) and look at the results. If the coefficients show that the dummy variables reflect the order of the original categories and if they differ by about the same magnitude, we may treat the variable GRD as a continuous variable.

Concerning the variable LAG, the problem is slightly different. We may argue that time is a continuous and interval scaled variable. But looking at the data, we find that most individuals have a lag of 1, 2, 3, or 4, with very few exceptions which in some cases are unusually high. Here, the strategy proposed with respect to the variable GRD - creating several dummy variables - won't work well, since by doing so we shall have one reference category (LAG of 0 to 4 or 6) and several dummy variables that each will represent one single case. Without going into detail, I want to propose the following: First, because of the very few outstanding values, it might be worth a try to change the scale of the variable in an appropriate way, for instance, by taking the logarithm. Second, since there are so few outliers, we might create a single dummy variable, taking values of 0 for "lag in the normal range" (say, 0 to 6) and 1 for "unusual great lag" (7 and more). This may be justified by arguing that probably what counts is whether or not an individual goes to college soon after leaving high school, while the exact amount of time elapsed is less important.

Let's start with the variable LAG. In y152_1c.cf, the second strategy is applied, that is, LAG is treated as a dummy variable, taking the value of 1 if LAG is greater than 6 months (with a constant of 1 added to all durations). In y152_1d.cf>, the logarithm of variable LAG is taken instead. Note that a value of 0.01 is added to variable LAG in order to avoid the error message resulting from trying to compute log(0). You will find that no matter how the variable LAG is coded, the results do not differ very much, except of course for the different magnitude of the coefficients. Therefore we may claim that we may treat this variable just the way we like. But we shall see that this is different in more complex models.

Next, we take a look at the variable GRD. In y152_1e.cf, four dummy variables are created from the variable GRD (in addition, LAG is also treated as dummy). The results show, in my opinion, that it is justified to use this variable just as if it was an interval-scaled continuous variable.

Last update: 28 Jan 2000