piecewise constant exponential model

Wolfgang Ludwig-Mayerhofer's Introduction to TDA

Incorporating time-dependence through a piecewise constant exponential model

Normally, I don't like piecewise constant exponential models (henceforth PCE models). They have become very fashionable, and especially so in German sociology. A PCE model means that the time span during which the process under investigation is observed is split into several intervals, or pieces (hence the name "piecewise"), and for each part a different constant (or baseline hazard) is estimated. The influences of the covariates usually are assumed to be the same in all these intervals; however, models are also available in which these influences may be different in the different slices of the time cake (if I may dare say so).

The reason why I don't like PCE models is that it is very easy to estimate them (all you have to do is to tell TDA the cutting points of the intervals) and they require no thinking. I have rarely seen a justification for the cutting points that were chosen in a given analysis, nor a post-hoc explanation of any differences that may have been found. But here, in our example, I feel there is a justification, and in addition this gives me a possibility to raise an interesting point, namely, testing coefficients for equality.

What is the justification for a piecewise constant model in the case of the data at hand? Well, we know indeed that the baseline hazard is different in different periods; in the summer months when a college year ends, the drop-out rate is higher. In the preceding sections, we have modeled this by including dummies for the different months. But actually this was only proxy. Students do not drop out because it is June or July; they drop out because each year of study starts in September or October and ends after ten months, and this is the time when they decide not to further pursue their studies. If a school year were to start in March or April, then drop-out would occur in January or February. It is not the month, it is the duration of one year of schooling that counts.

Therefore, it would be more meaningful to model a higher drop-out rate after, say, 9 to 11 months (for the first year), after 21 to 23 months (for the second year) and after 33 to 35 months (for the third year). And this is indeed what a PCE model can do. And it is superior even in another aspect. The models in the preceding sections assumed that the drop-out rate was the same in all months of May, June, July etc. But this need not necessarily be true. We may assume that drop-out is highest in the first year, because investments made until this point in time are not as high as after two or three years. Therefore, we may wish to test the assumption that drop-out rates in the second, third and fourth year are lower than those in the first year. I will explain below how this can be achieved.

To make things not too complicated, I neglect the marriage variable for the moment. This enables me to work with the original data set without episode splitting. To incorporate marriage, we have to use the split data set, a task I leave to the readers.

Model 1: A simple piecewise constant exponential model

To estimate a PCE model, all we have to do is to add a line in the rate section of the command file that indicates the intervals for which the different constants should be estimated. In addition, the number indicating the model to be estimated after the closing parenthesis has to be changed to "3". In our case the rate part looks like this:

rate(
    tp = 0,9,12,21,24,33,36,
    xa(0,1) = V4,V5,V6,V7,V10,
    )=3;

This means that TDA estimates different constants for the time periods from month 0 to 8 (to be more exact: less than 9), from 9 to 11, 12 to 20, 21 to 23, 24 to 32, 33 to 36, and 36 until the end of the time period covered in our data.

TDA's output now contains a new section indicating the different intervals, the number of episodes that start in this interval, the number of episodes that end in this interval and the number of events in this interval. As I use the data without splitting, all starting times are found in the first interval from the beginning of the process until less than 9 months. Also, as we have few censored cases before the end of the four year observation period, the number of ending times is equal or almost equal to the number of events (that is, drop-outs) for all but the last interval.

Time period		Starting times	Ending times	Events
0.0000 -	9.0000	265	37	37
9.0000 -	12.0000	0	20	20
12.0000 -	21.0000	0	10	10
21.0000 -	24.0000	0	14	14
24.0000 -	33.0000	0	11	10
33.0000 -	36.0000	0	11	9
36.0000 -		0	162	7

Here are the results of the estimation:

Idx	SN	Org	Des	MT	Variable	Label	Coeff	Error	C/Error	Signif
1	1	0	1	A	Period-1		-4.9663	0.2693	-18.4421	1.0000
2	1	0	1	A	Period-2		-4.1928	0.2954	-14.1949	1.0000
3	1	0	1	A	Period-3		-5.9171	0.3693	-16.0212	1.0000
4	1	0	1	A	Period-4		-4.3895	0.3267	-13.4376	1.0000
5	1	0	1	A	Period-5		-5.7751	0.3678	-15.7035	1.0000
6	1	0	1	A	Period-6		-4.6721	0.3813	-12.2529	1.0000
7	1	0	1	A	Period-7		-5.5978	0.4190	-13.3585	1.0000
8	1	0	1	A	V4	SEX	0.3021	0.2015	1.4992	0.8662
9	1	0	1	A	V5	GRD	0.3155	0.0904	3.4899	0.9995
10	1	0	1	A	V6	PRT	1.2896	0.2826	4.5633	1.0000
11	1	0	1	A	V7	LAG	2.2065	0.5209	4.2358	1.0000
12	1	0	1	A	V10	PRT*LAG	-2.2642	0.8133	-2.7841	0.9946

Log likelihood (starting values): -573.2319
Log likelihood (final estimates): -531.3046

Obviously, the coefficients for "Period-1" to "Period-7" refer to the seven intervals defined above. We note that indeed the coefficients for "Period-2", "Period-4" and "Period-6", that is for the end of the school year, are much higher than those for the other intervals, as expected.

We also note, however, that the coefficients for "Period-2", "Period-4" and "Period-6" are largely equal, and so seem those for "Period-3", "Period-5" and "Period-7". But without a proper test, this is just guesswork. Luckily, with TDA such a test can be performed easily, because TDA permits to estimate a model with the assumption that these coefficients are equal "built into" the model, as it were. If this model performs significantly worse -- as measured by the likelihood ratio test --, this assumption was not justified. On the other hand, if the likelihood does not decrease significantly, we have guessed right.

Model 2: A piecewise constant exponential model with constraints

In the language of TDA, to estimate a model where some parameters are equal means to put constraints on the model. Consequently, the subcommand is named con. This keyword has to be followed by an equation that expresses the desired constraint. One equation that expresses equality of two coefficients would state that the difference between these coefficients is equal to zero. To formulate the constraints, the coefficients must be termed bX, with X replaced by the consecutive number of the coefficient as exhibited in the output. For instance, the coefficient for "Period-5" has number 5 in the first column (termed Idx for "index"), and therefore is addressed as b5. If we were to address the coefficient for SEX, we would have to use the name b8 in the expression for the constraint. Therefore, the rate command may look like this:

rate(
    tp = 0,9,12,21,24,33,36,
    xa(0,1) = V4,V5,V6,V7,V10,
    con= b5 - b7 = 0,
    con= b5 - b3 = 0,
    con= b6 - b4 = 0,
    con= b6 - b2 = 0,
    )=3;

These constraints are repeated (for clarity's sake) in the output. This section looks as follows:

Checking constraints.
Con: 1 * b5 - 1 * b7 = 0
Con: 1 * b5 - 1 * b3 = 0
Con: 1 * b6 - 1 * b4 = 0
Con: 1 * b6 - 1 * b2 = 0

The results can be found in the following table:

Idx	SN	Org	Des	MT	Variable	Label	Coeff	Error	C/Error	Signif
1	1	0	1	A	Period-1		-4.9695	0.2694	-18.4478	1.0000
2	1	0	1	A	Period-2		-4.3771	0.2428	-18.0253	1.0000
3	1	0	1	A	Period-3		-5.7918	0.2687	-21.5582	1.0000
4	1	0	1	A	Period-4		-4.3771	0.2428	-18.0253	1.0000
5	1	0	1	A	Period-5		-5.7918	0.2687	-21.5582	1.0000
6	1	0	1	A	Period-6		-4.3771	0.2428	-18.0253	1.0000
7	1	0	1	A	Period-7		-5.7918	0.2687	-21.5582	1.0000
8	1	0	1	A	V4	SEX	0.3030	0.2016	1.5034	0.8673
9	1	0	1	A	V5	GRD	0.3167	0.0905	3.5009	0.9995
10	1	0	1	A	V6	PRT	1.2941	0.2825	4.5815	1.0000
11	1	0	1	A	V7	LAG	2.2202	0.5202	4.2679	1.0000
12	1	0	1	A	V10	PRT*LAG	-2.2990	0.8127	-2.8287	0.9953

Log likelihood (starting values): -573.2319
Log likelihood (final estimates): -532.2643

It can be seen that the difference between the log likelihood of the previous model and this model is less than 1, and therefore twice this difference -- this is the test statistic for the likelihood ratio test -- is less than 2. This would not be a significant difference even with one degree of freedom, but as we have four constrains, this model has four degrees of freedom less than the previous model, and it would require a test statistic of 9.49 for the difference to be significant. Therefore, we may quite safely conclude that our assumption about the equality of the coefficients was justified.

In the next step, we might also test the assumption (which I have up to this point taken for granted) whether the coefficient for "Period-1" is indeed different from the coefficients for periods 3, 5 and 7, for instance, by introducing an additional constraint b5 - b1 = 0. As a model with this additional constraint exhibits a significantly worse fit (the log likelihood for thismodel being -537.46), we can see some justification for the hypothesis that drop-out during the first year is higher than during the second, third and forth year.

Last update: 04 Apr 2000