Wolfgang Ludwig-Mayerhofer's Introduction to TDA

Reference to the data file and definition of variables

The central command for reading data and defining variables is the

nvar (    );

command. Typically, it consists of a reference to a data file and of a definition of the variables therein. I will discuss these items in turn.

Reading a data file

In the following it is assumed that your data are stored in an ASCII file. From this file, TDA system files may be created, but ASCII files will be used throughout the analyses that follow (mainly because the data set is not very big). But note that TDA can also read SPSS, STATA, and dBASE files, and as a very special and important feature also archive files, i. e. files that have been compressed in order to save disk space; this may be explained in a later version of this introduction.

Reading the raw data is achieved with the dfile subcommand which has the form

dfile = data.asc,

where you have to insert the name of your data file in the place of data.asc. In our example, the name of the data file is "yam152_a.dat" (note that TDA does not require – and will not accept – quotation marks around the name of the data file!). So you may write:

nvar(
     dfile=yam152_a.dat,
    );

This command, however, will do nothing, since you have not yet provided the information TDA needs to "interpret" the numbers in your file, i.e., you have not declared the variables in your data set.

Reading and creating data

Within the nvar command, you will tell TDA to read some (or all) data from your file. But you will also often create additional variables. TDA offers a great wealth of possibilities, but we shall cover only some very simple cases. More about this is found in Chapter 5.2 of the TDA Manual.

All data definitions have to use the subcommand

Vname = ....,

where Vname means a variable name that you will use in the following to refer to that variable. On the right hand of the equals sign, this variable is defined, in many instances by reference to the data file, but in some instances also using (and perhaps transforming) variables you have already read from your data. The first character of Vname has to be an upper case letter. Vname can consist of up to 16 characters which may be either letters, numbers, the underline sign (_), the @ sign, and finally the US-Dollar ($) sign.

Elementary variables from your data file are variables to be used just as they stand. These are defined by reference to the column they occupy in the data set:

Vname1 = ci,
Vname2 = cj,
Vname3 = ck,
...

with c being mandatory and i, j, k and so forth being the appropriate column in the data set. Note that a given column, say, 5, refers to the respective variable in your data set (e.g., the 5th variable), not to the physical column that is identical to your cursor position when you look at your data with a DOS editor or a word processor.

Variables may be labeled by inserting a label in parentheses like this:

Vname (LABEL) = ....,

In the example data set, the first variable is a case number, the second one refers to a duration variable (duration until drop-out from college), the third to a "state" variable (whether, at the end of the observation period for each individual, the individual had indeed dropped out from college or not), and the fourth variable indicates the sex of the subjects under study. These variables (save the first one) are defined in the example command files as follows:

nvar(
     dfile=yam152_a.dat,
     V2 (DUR) = c2,
     V3 (DES) = c3,
     V4 (SEX) = c4,
    );

Note that the number of the variable does not have to be identical to j, the column number. (Of course you do not have to use numbers in your variable names at all; actually you might call these variables "DUR", "DES" and "SEX" from the outset (or "Gender", if you prefer, but this data set has only the attributes "boy" or "girl", probably measured by implicit reference to some biological properties of the subjects under study).

To define new variables - that is, variables that are not to be used "as is" - from your data set, a wide range of mathematical and logical operators is available (see TDA manual, chapter 5.2 ).

First, you may use arithmetical operators like +, -, * and /. For instance, in the models below, an interaction effect between two variables, V6 and V7 is needed, which will be created as follows:

nvar(
     dfile=yam152_a.dat,
     V2 (DUR) = c2,
     V3 (DES) = c3,
     V4 (SEX) = c4,
     ...
     V6 (PRT) = c4,
     V7 (LAG) = c7,
     ...
     V10 (PRT*LAG) = V6 * V7,
    );

Some other arithmetical operators are

x ^ y power of (where y must not be negative)
x % y modulus operator

Mathematical operators provide more complex possibilities. Some of the most important are:

rnd (x) integer value of x, rounded to the nearest integer
abs (x) absolute value of x
sqrt (x) square root of x
log (x) natural logarithm of x
exp (x) exponential function (anti log) of x

These operators are not restricted to elementary variables; they may also refer to complex variables as in the following example:

V7 (L-LAG) = log(c7 + 0.01),

Constants are created by the command

Vname = c,

where "c" may be any number you like. As mentioned above, the example data set contains neither an origin state variable nor a starting time variable. Since both variables will have the value zero, we shall define a constant zero by the command:

V1 (ZERO) = 0,

Logical operators are operators like and, or, not, but also operators used in comparisons such as eq (equal to), gt (greater than) and so on. These are not logical in the strict sense, but the result of the comparison may be either TRUE (if, for instance, c7 is equal to 6) or FALSE (if otherwise). The results of all logical operations are translated into 1 (= true) and 0 (= false). For instance, the command

V7 (LAG-gt6) = gt(c7,6),

creates a variable V7 which has value 1 if the value observed in the 7th column of the data set is greater than 6, and value 0 otherwise.

Here are the logical operators:

not (E) results in 1 if expression E is false, and 0 otherwise
and (E1,E2) yields 1 if both E1 and E2 are true, and 0 otherwise. May also be written as E1 & E2
or (E1,E2) yields 1 if at least one of E1 and E2 is true, and 0 if both are false. May also be written as E1 | E2
eq (E1,E2) yields 1 if E1 equals E2, and 0 otherwise
ne (E1,E2) yields 1 if E1 is not equal to E2, and 0 otherwise
lt (E1,E2) yields 1 if E1 is less than E2, and 0 otherwise
le (E1,E2) yields 1 if E1 is less than or equal E2, and 0 otherwise
gt (E1,E2) yields 1 if E1 is greater than E2, and 0 otherwise
ge (E1,E2) yields 1 if E1 is greater than or equal E2, and 0 otherwise
if (E,E1,E2) if E is true, the resulting variable will have the value E1, otherwise it will assume value E2.

Dummy variables may be created as follows:

Vname = cj[val1, val2, val3 ...],

with val1 etc. being any values of the variable found in column j (instead of reference to a column, as in this example, you may use any other expression). This is very nice if you want to refer to several values, helping you to omit a lot of or statements. In our example data set, there is no instance in which this way of creating dummy variables is superior to using the eq operator, but still I have used it, for instance, in creating four dummy variables to represent 4 (out of the 5) categories of the variable GRD (high school grade) in column 5:

V30 (GRD2) = c5[2],
V31 (GRD3) = c5[3],
V32 (GRD4) = c5[4],
V33 (GRD5) = c5[5],

To complete this section, here's the variable definition part of a TDA command file that will be employed later. Most variable definitions are elementary, with two exceptions. The first is V2, which is defined as c2 plus a constant of 1 (the reasons for this will be explained below). The second is V10, which is constructed as interaction of V6 and V7 by multiplying both variables. I use the "mathematical" multiplication operator here. If both PRT and LAG were defined as dummy variables, the "logical" multiplication operator (the ampersand character) might by used instead.

nvar(
dfile = yam152_a.dat, # data file
V1 (ZERO) = 0, # define a constant with value zero
V2 (DUR) = c2 + 1, # ending time = duration
V3 (DES) = c3, # destination state
V4 (SEX) = c4,
V5 (GRD) = c5, # high school grades
V6 (PRT) = c6, # part-time student
V7 (LAG) = c7, # time lag high school - college
V8 (MRG) = c8, # time of marriage
V9 (STM) = c9, # starting month
V10 (PRT*LAG) = v6 * v7, # interaction effect
    );

The size of the data file is barely limited, the maximum number of variables being 5,000, and the maximum number of cases, 1,000,000. However, as a default TDA assumes that there are not more than 1,000 cases in the data file. If your data set is larger, this has to be indicated by the subcommand

noc = x,

with x being replaced by the number of cases in your data set. Note that x may be larger than the actual number of cases, but TDA will reserve the amount of memory necessary for processing x cases.

Some special features may be mentioned without going into much detail:

First, you may also define the "length" of your variables, or to be more precise, the amount of computer memory they occupy. This is achieved with a special code following the variable name enclosed within <>. For a 0-1 dummy variable, a single bit of memory is needed per case (indicated by Vname <0>=...), while a floating point number, which is assumed by default, will occupy 4 bytes. With very large data sets, definition of variable length will reduce the amount of memory needed dramatically when you have several dummy or integer variables. Note, however, that you are responsible for proper indication of variable length. If you define a variable that actually has integer values 0 to 20 as a single bit variable, you will get only 0 and 1 values. For details, see chapter 2.1 of the TDA manual.

The format of the data file may be free-field or fixed-format. Fixed format is necessary if missing data are to be indicated by blanks (a rather bad practice, by the way, since with blanks you never may be sure whether the data are really missing or just have been forgotten). Otherwise, free-field format may be used. Note that TDA does not handle alphanumerical data. However, alphanumerical strings starting with a point "." or an asterisk "*" will automatically be converted to "-1" (this value may be changed with the mpnt or mstar command, respectively, that are explained in chapter 2.2 of the TDA manual.)

As a default, TDA assumes that the data are free-field. When data are fixed-format, the format must be indicated with "ffmt" command (see again chapter 2.2 of the TDA manual). Note that in neither case the data set has to be defined (or read into memory) entirely.

One nice feature of TDA is that also the data set may contain comments. Each line of the comment has to start with the # sign.

Last update: 25 Nov 1999