Episode Splitting

Wolfgang Ludwig-Mayerhofer's Introduction to TDA

Incorporating time-dependent covariates by episode splitting

One important way of dealing with time-dependent covariates is the method of episode splitting. Here, every time a covariate changes its value during an episode, the episode is split up at that point in time, yielding two new episodes. If the covariate changes its value several times, the episode is split up several times accordingly. This, of course, is also the case with the month variables in our present example. Depending on the number of splits, the resulting data sets may be very large, but in spite of this, computation will be quite fast. In addition, we now may use parametric hazard rate models, be it for continuous or for discrete variables.

What is episode splitting about? Consider case # 53 in the data set. This person entered college in month 9 (meaning September 1980) and married in month 43 (i.e. July 1983). In other words, case # 53 married after 34 months spent in college. The entire time spent in college for this case is 41 months.

To account for this change in the variable of being married or being not married, the data for case # 53 (and for all other cases that marry while being in college) will be split in two parts: One part will cover the time from entering college until marriage, and the other part will cover the time from marriage until the end of the observation period for this case. That is, after splitting we will have two rows in the data set, one from the start until marriage and one for the rest of the time. The important thing is that we can now incorporate a dummy variable for the status of being married, with value 0 (not married) for the first period and value 1 (married) for the second period.

There are not many cases in the data set in which episode splitting is necessary. A few cases married prior to entering college; these have a value of 1 in the dummy variable for marriage status from the outset (if they were to divorce, one might split the respective episode and re-set the variable to 0, but no information seems available concerning this). Most of the cases never marry at all while in college, and again no splitting is necessary, as they will have a value of 0 in the marriage status variable all of the time. But those few cases who marry while being in college will be split into two episodes. Each of these two episodes (per case) will look like a complete "case" inasmuch it will carry all the necessary variables. However, there is one important difference: If an episode has been split, this means that the first part of the former episode, i.e. the new episode referring to the period prior to the change in the time-dependent covariate, has to be considered as "censored", as the event (if any) will occur only in the last sub-episode (if the event had happened before the change in the time-dependent covariate, there would have been no splitting). In other words, there has to be a new des variable; but TDA will provide for this automatically.

Note that this procedure, even though it looks as if we had increased the number of cases in the data, actually does not invalidate the statistical inference. This is because the real "units" of event history analysis are not individuals, but "individuals * time". And the time span covered by case # 53 does not change at all by episode splitting

I shall proceed in two steps: First, episode splitting is applied to the marriage variable. In a second step, the data set will be split by months. Actually, the first step is unnecessary in the light of the second; but splitting by one or a few single dates is something that occurs very frequently, while the second step is quite special. On the other hand, splitting by months is not really necessary here, as in my opinion the most appropriate way to deal with these data is a piecewise exponential model, which will be explained later on. Therefore, the exposition here is mainly for didactical reasons.

Step 1: Splitting by time of marriage only

As can be seen from my short introduction to this topic, episode splitting involves two things: First, telling TDA at which point of time to split episodes (if splitting is necessary at all), and second, creating the variable the value of which changes at the time the episode is split.

In our example, we have first to prepare the variable concerning time of marriage appropriately. This is because time of marriage is represented in the data in relationship to calendar time, whereas time until drop-out (or until censoring) is measured in "process time" (the time measuring the process under observation, i.e. being in college). But for the episode splitting, time of marriage must be in the same "time frame", as it were, as process time. If, for instance, a marriage occurs 10 months after a student has entered college, the episode must be split after 10 months.

Both variables, time of marriage and process time, can be related to each other via the variable "starting month" which represents calendar time (as has been alluded to above). Note that the definition of this new variable for time of marriage has to be accomplished within the nvar ( ); command. Thefore, we will include one subcommand like this:

nvar (
     .
     .
     .
     V11 (Marriage) = if eq(V8,99) then 99 else V8-V9,
     );

The additional variable V11 assumes the value of 99 when an individual has never married (i.e., if V8 is equal to 99). Otherwise, the timing of marriage, related to process time, is computed by subtracting the starting time, V9. Note that if someone has married before entering college, which indeed is the case in a few instances, V11 will have a negative value, meaning precisely what it is supposed to mean: marriage has occured before the process under investigation started.

The decisive things happen in the edef ( ); section. Remember that we need to do two things: Split our data (when appropriate) and define a variable indicating the change in marriage status. The former is accomplished (very appropriately) with the split ..., subcommand, whereas the latter needs no specific subcommand and can be computed like any other variable. Or ... almost so, as usually we will need a reference to the process time, something that cannot be included normally in definitions of variables.

Even though the split ..., subcommand has to be the last command in the edef ( ); section, I will start with it, as the creation of the time-dependent covariate can be understood only in relation to what happens when the episodes are split. Splitting is very easy indeed: All we have to do is to indicate the variable by which the episodes have to be split (or the variables, if splitting is to be accomplished concerning several variables). In our case, the split ..., subcommand looks like this:

edef(
     .
     .
     .
     split = V11,
     );

What exactly is happening now? Whenever TDA finds a case in which V11 has a value greater than the starting time (in our case, 0 for all cases) and less then the ending (or failure) time (this time varies from case to case), the respective episode is split at the value that is indicated by V11.

Now to the definition of the time-dependent covariate, in our case a dummy variable with values 0 (not married) and 1 (married). Note that "time-dependent" does not mean that the value of the variable has to change in all cases; it just means that it may change (and of course the whole thing is meaningful only if at least some changes occur). How can this variable be defined?

Let's take our example case # 53 from above. This case will now have two rows in the data set, one with starting time 0 and ending time 34, and one with starting time 34 and ending time 42. As the variable measuring time of marriage in process time, V11, also has the value 34, we might say that the time-dependent covariate should have the value 1 for all episodes which have a starting time equal to the value of V11. But we must not forget that some cases will have negative times of marriage. Therefore, we should define the time-dependent variable as taking a value of 1 if V11 is less or equal to the starting time of an episode. You can refer to the starting time by the keyword ts (as well as to the ending, or failure, time by tf).

Thus, the entire edef ( ); section will look like this:

edef(
     ts = V1,
     tf = V2,
     org = V1,
     des = V3,
     V12(Married) = le(V11,ts),
     split = V11,
     );

With these data, you may immediately perform an analysis, an example of which you find in the file y152_es1.cf for an exponential model. However, in the case of the Cox regression model (and the Kaplan-Meier estimator), the data must first be written to a new file. But beginners are heavily advised to write their split data to a new file at any rate in order to check whether they indeed have achieved what they planned to do.

Writing a data set with episode data to an ASCII file is accomplished with the epdat ( ); command, which I will explain by an example:

epdat (
      v = V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,
      dtda = y152spl1.cf,
      ) = y152spl1.dat;

The v ..., subcommand obviously tells TDA which variables to write to the output set. However, TDA will always write some additional variables:

A variable named ID to identify cases. This variable is created by giving a consecutive number to each case. Thus, you will be able to see to which case the sub-episodes belong. (Of course, you may also retain your own id variable; but you cannot suppress the creation of TDA's additional variable.)
A variable named SN to identify the spell number (which may important if individuals have more than one spell).
A variable named NSPL indicating how many splitted episodes have been created from each episode.
A variable with the name SPL giving the consecutive numbers of the splitted episodes (i.e. 1 for the first split, 2 for the second etc.).
A variable named ORG for the new origin state (which, however, in most cases will be the same as the old variable indicating the origin).
A variable named DES for the new destination state. This variable is decisive. Imagine an episode that lasts 40 months and ends with a dropout (i.e. has destination 1 in our example) is split at month 20. The first sub-episode will last from 0 to 20, the second one from 20 to 40. While the second sub-episode will have the destination state 1, this is not appropriate for the first sub-episode, which has to have a destination state of 0, as no change has occured at the time this sub-episode ends (except for the time-dependent covariate, which is a different matter).
A variable named TS for the new starting times. Only the first sub-episode will start with 0 (or whatever the former starting time for the entire episode may have been), all other sub-episodes will start at the time the previous sub-episode ended.
A variable named TF for the new ending times. All sub-episodes save the last one for each case will end at that point in time when the episode was split.

The dtda subcommand requests TDA to produce a description of the resulting data set in the form of a nvar ( ); command written to the file indicated (here, y152spl1.cf). I will leave it to the reader to produce this file and to interpret it in the light of what I have explained above.

Finally, after the closing parenthesis and the "=" sign, the name of a file has to be given into which the data will be written. I will now reproduce a small part from this data set, with most of the cases omitted. The names of the variables in the first row are given here for clarification, they will not be written to the data file by TDA.

ID	SN	NSPL	SPL	ORG	DES	TS	TF	V1	V2	V3	V4	V5	V6	V7	V8	V9	V10	V11	V12
1	1	1	1	0	0	0.00	42.00	0	42	0	1	1	0	3	99	9	0	99	0
2	1	1	1	0	1	0.00	9.00	0	9	1	0	3	1	3	99	9	3	99	0
.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.
8	1	1	1	0	1	0.00	5.00	0	5	1	0	4	0	2	33	8	0	25	0
.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.
35	1	1	1	0	1	0.00	9.00	0	9	1	0	0	0	4	8	10	0	-2	1
.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.
53	1	2	1	0	0	0.00	34.00	0	42	0	1	2	0	3	43	9	0	34	0
53	1	2	2	0	0	34.00	42.00	0	42	0	1	2	0	3	43	9	0	34	1
.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.	.
68	1	1	1	0	1	0.00	2.00	0	2	1	1	1	0	0	2	6	0	-4	1

Note that in case # 8, who married after leaving college (see V8 for time of marriage related to "calendar" time, and V11 for time of marriage related to process time), no episode splitting occurred and the variable V12 or "Married" has value zero. Conversely, case # 35 has married prior to entering college (with V11 having a negative value), and thus again no splitting was necessary, while V12 ("Married") has value 1 (see also case # 68). The first split occurs in case # 53.

It should be obvious that for the ensuing data analysis the new ORG, DES, TS and TF variables have to be used, and therefore the old variables (V1 to V3) are not really needed. However, as we often will write the new data to a file mainly in order to be able to check them, it will be very useful to keep the old variables in the new file. When you estimate a model immediately after splitting the data (i.e. with the split data in memory), TDA will automatically use the appropriate (new) variables.

Step 2: Splitting by month

The task of splitting the data set by months and creating the appropriate month dummy variables is less complicated than it might seem at first sight. Episode splitting by month means that we have to create several constants with values 1, 2, 3 and so forth up to 48. Then the data will be split by these constants, resulting in episodes from month 0 to 1, from 1 to 2, and so on. Together with the variable "starting time" (referring to the month the students started at college), we may compute then dummy variables for the different months.

The computation of the constants 1 to 48 is done in the nvar ( ); section; you can find excerpts here:

nvar(
dfile	= yam152_a.dat,	# data file
V1 (ZERO)	= 0,	# define a constant with value zero
V2 (DUR)	= c2 + 1,	# ending time = duration
V3 (DES)	= c3,	# destination state
V4 (SEX)	= c4,
V5 (GRD)	= c5,	# high school grades
V6 (PRT)	= c6,	# part-time student
V7 (LAG)	= c7,	# time lag high school - college
V8 (MRG)	= c8,	# time of marriage
V9 (STM)	= c9,	# starting month
V10 (PRT*LAG)	= v6 * v7,	# interaction effect
M1	= 1,	# first constant for monthly splitting
M2	= 2,	# second constant for monthly splitting
.
.
M48	= 48,	# 48th (and last) constant for monthly splitting
);

The time dependent covariates again have to be defined in the edef ( ); section, which in our case looks like this:

edef(
     ts = V1,
     tf = V2,
     org = V1,
     des = V3,
     V12(Married) = le(V11,ts),
     V13 (M1) = eq ( (ts + V9)%12,1 ),
     V14 (M2) = eq ( (ts + V9)%12,2 ),


     V23 (M12) = eq ( (ts + V9)%12,0 ),
     split = M1,,M48,
     );

The expression M1,,M48 in the split subcommand means that all variables from M1 to (and including) M48 are to be used for splitting the episodes.

You will find a complete command file to achieve the splitting in y152_es5.cf. This command file will produce a data description file with the name y152_s5.cf, but you can find here a file y152_s5x.cf that also gives the appropriate edef ( ); and rate ( ); commands to estimate a Cox regression model with the split data set. The data description file produced by TDA unfortunately does not reproduce the variable labels; I have added the ones you find in y152_s5x.cf.

Note that in episode splitting, it seems not appropriate to subtract a value of 1 from the time-dependent covariates as we have done when incorporating these directly in the Cox regression model.

Last update: 28 Jan 2000