Data

Data consist of cases and variables that are set up in a matrix. That is, you have several cases which are the units of your analysis, and each case has one or several variables attached. Cases may be anything you like -- persons, countries, geographical units (cities, regions, lakes), or any other entity (trees, birds, observations from chemical experiments, temperature measurements ... really anything). For each case, there is at least one observation; usually, the same information is obtained for all cases (even though especially in the social sciences, "missing values", i.e. missing information due to lack of respondent information, errors in data processing etc., are pervasive).

In this chapter, it is assumed that you already have a data set at hand, and I will explain how you can get acquainted with your data. Eventually, I will say a few words (very few words indeed) about entering your own data. Look at section "Handling Data Files" (see navigation) about reading in, describing and re-arranging data.

Note that until version 10 you could not execute commands as long as the data window was open. In other words, you first had to close the data window before proceeding. As of version 11, this is no longer true; you may leave the data window open and go on with your work. Still, by default Stata will not display the data; you have to use the browse or the the edit commands to be described shortly.

Variable names

Most of the time, you will use an existing dataset, with variables already present. But you will usually create additional variables, and sometimes you will create a new dataset of your own. Therefore, it will be useful to be aware of Stata's conventions for naming variables. They are very simple:

  • Variable names must start with a letter or an underscore. (However, there is a number of built-in, or "system", variables that all start with an underscore; therefore, you better avoid this for your own variables. Underscores at other positions will not create problems; in fact, I use them copiously for the sake of legibility.)
  • In addition to letters and the underscore, digits (i.e., numbers 0 to 9) are permitted. (You will often encounter data sets with variable names such as "v1" to "v870").
  • The names must not be longer than 32 characters.

If you are speaking a language that uses Umlaute or perhaps other special characters, you may be inclined to use them for naming variables. I'd advise strongly against this practice unless you are absolutely sure about the consequences of what you are doing.

Information about variables in a dataset

In Stata's default setting, a list of all variables in your current data set is displayed in the Variables panel. You can browse this panel and look at variable names and labels. The display of the latter will be rather short, unless you widen the panel at the cost of the Results window. The following commands will help you to find variables of interest and basic information about them.

Describe

describe income hhincome edu

will display information about data storage and display, as well as the full variable label (see entry about labels in section "Handling Data Files"), for the three variables named. describe alone will yield the information for all variables. The command may be abbreviated to d.

Codebook

Codebook is one way to obtain information about

  • the range of values present in a variable,
  • the label attached to a variable,
  • the missing values present in a variable.

It works simply like this:

codebook hhincome edu

codebook alone (with no further specification) will produce a codebook for all variables in the dataset.

Note the following about codebook's default behaviour:

  • If a variable has no more than 9 distinct nonmissing values, a simple frequency table will be produced that displays both the numerical values and the labels attached to these.
  • If a variable has more than 9 distinct nonmissing values, Stata's behaviour will depend on the presence of a value label for this variable:
    • If no value labels are present, Stata will display the 10th, 25th, 50th, 75th and the 90th percentile of the variable.
    • If value labels are present, Stata will display a selection of values, a selection that is arbitrary as far as I can see. (The reason for this seemingly erratic behaviour is that Stata 'thinks' that a variable with value labels is not continuous, a somewhat optimistic assumption given that continuous variables may contain missing values for which labels will be helpful.)

The default behaviour may be changed with the help of option tabulate(#) or t(#). Let's assume that variable "class" has eleven categories (yes, some class schemes are that elaborate). To obtain a frequency table from codebook, type:

codebook class, t(11)

You may use a number greater than the number of categories present in this specific variable. For instance, you may decide that you wish to have frequency tables for all variables with more than 20 (nonmissing) values. You might type

codebook, t(20)

to get a codebook for your entire dataset, and whenever Stata encounters a variable with 20 or fewer nonmissing categories, if will display a frequency table.

Lookfor

With large data sets, it is tedious to find a specific variable. Now, you will often remember parts of a variable (or at least you think you do), but not its exact name. For instance, I worked with a data set which contained variables whose names contained the number 144; but I knew little about the rest. Entering

lookfor 144

yielded four variables; for each the information was displayed that I would have obtained with describe. Look also further down at section "variable lists".

But lookfor can do more than that. It looks also for (parts of) variable labels. So, if you don't know the name of a variable you are seeking but remember (or think you do) its label or parts of it, you can enter whatever bits you remember and hope that Stata will retrieve what you are looking for.

Looking at data

In what follows I describe a few simple ways of looking at your data. .

List

list var1 var17

will display the values of variables "var1" and "var17" in the output. If your data set is large (that is, if it has many cases), this is not a convenient way even if you restrict yourself to very few variables as in this example. If you want to look only at a small section of your cases (for instance, to check if a transformation has worked out the right way) you can use, e.g.:

l var1 var17 in 1/20

Here, the first 20 cases will be listed. The first value does not have to be unity, that is, you can start and end wherever you like, as in l var1 var17 in 337/402. If the second value is beyond the range of the number of observations, you will get an error message.

l

will list all variables (and all cases).

Finally,

list var1 var17 in 1/20, nol

will display the raw values of var1 and var17 instead of the value labels that are displayed by default (provided they have been defined, of course.)

Here is a list of a few other options you may find useful. More options can be found via help list.

clean suppresses separator and divider lines
nooobs suppresses the observation numbers normally shown in the first column
compress compresses variable names where possible to reduce column width
sep[arator](##) will draw a separation line every ## row

Browse

browse var1 var17

will open the data "browser". This looks like the data editor, but you cannot enter or change data. Note that with older versions (up to and including Stata 10) you have to close the browser before you can continue working with Stata.

Again, you may write, e.g.:

br var1 var17 in 1/20

Here, the first 20 cases will be displayed.

If labels are attached to the values of a variable, the data editor will display the labels by default. The "nol" (or "nolabel") option will cause Stata to display the raw values instead:

br var1 var17 in 1/20, nol

Of course, you may just enter "br" and will be able to look at the entire matrix.

Edit

edit var1 var17

will enable you to look at the data and change them. For instance, you might enter new data or change those that you can see. The variables that are not displayed will not be affected, of course. Note again that with older versions (up to and including Stata 10) you can only continue working after closing the data editor. If you have changed data, you will be asked whether or not changes are to be saved in the data matrix (note that saving here does not mean that changes are saved to the file on your disk).

The possibilities to restrict the range of cases described above apply as well, just as will the "nol" option. "Ed" alone will open the editor with the entire data matrix.

Abbreviating variables, variable and values lists

Abbreviating variable names

As mentioned in the heading of this section, variables can be abbreviated as well. This means that if a character or a series of characters uniquely identifies or identify a variable, it is sufficient to input this (series of) character(s) wherever variable names are required. Thus, if there is a variable "trust" in your data and no other variable names begings with letter t, you have to type only "t" wherever you might put "trust". If, in contrast, there is also a variable "tool", you have to type at leat "tr" to identify the variable "trust".

If there are several variables with similar names and you wish to refer to all of these, look up the next section.

Variabe lists

If, within a command, you have to refer to a larger number of variables, this may become cumbersome. If several of these variables are adjacent in your data set (which happens not infrequently; think about frequency distributions for a list of items), you may refer to these as follows:

tab1 pk2301-pk2314

Of course, listing single variables and referring to a series of variables may be combined, as in

tab1 pk2301-pk2314 pk2317 pk2319-pk2323 pk2326 pk2328

It often happens that there is a series of variables that is different only by a suffix. Suppose you have measured "trust" in several entities; typically, the variables may be named "trust1", "trust2", "trust3" and so on, or "trustpolice", "trustpolitician", "trustbanks", "trustcourts" "trustparl" etc. In this case, you may refer to all of these variables using the wildcard:

tab1 trust*

The wildcard can also be placed at the beginning of the variable names. Assume you have measured the amount of trust in several entities and the main reason for trusting each entity. Assume that the first series of variables is name "atrustpolice", "atrustpolitician", and so on, and the second series "rtrustpolice", "rtrustpolitician", and so on.

tab1 *trust*

will tabulate all of these variables.

Finally, the question mark may be used to replace a single character of a variable name. For instance,

tab1 t?ust?

will tabulate "trust1", "tbust1", "trustx", "tcustl" etc.

Value and case lists

In some procedures, you may refer to a series of values of a variable. For instance, you may wish to recode income to a smaller set of values. You may write

recode income (1/300 = 1) (301/600 = 2) (601/1000 = 3)

and so on. There is also an important option, "in", that can be used in most procedures, which restricts procedures to a number of cases. For instance,

tab bild2 in 1/50

will tabulate the values of variable bild2 for the first 50 cases only.

Creating your own data set

Entering your own data is simple. Just type "edit" in the command line; the editor will open and you can start right away. If you have already opened another data set, you have to close this one first via the command "clear".

When creating (entering) your own data, you should be careful not to make your work too complicated. Therefore, a few rules should be taken into consideration.

A data file is a rectangular matrix in which usually each case consists of one row, and the variables form the column of the matrix. Variables have to be named; the cases do not have "names", but of course there should be one or even more variables that "identify" each case (for instance, an identification number that refers to the questionnaire from which the data were entered). Variable names in Stata may have up to 32 characters now (i.e. version 9 or higher); in earlier versions they were restricted to 8 characters, and therefore it may be considered useful to restrict variable names to this limit. This will also make handling of variables easier.

For variable names, see the section at the beginning of this entry.

The properties of variables can be changed via commands or via the "properties" window. This window is available both in the "data editor" view and in the main window. In the first case, click on a data column to show the respective variable in the "properties" window. In the latter case, click on the variable name in the "variables" window. Be sure to note the small lock underneath the heading "properties". This is a toggle which will provide (or forbid) access to the fields in the "properties" window.

Changing characteristics of variables

As of version 11, the new Variables Manager allows to change variable names, labels, types, and formats interactively. Just type varmanage (no typo! the final r is absent and including it will result in an error message) and a new window will open. Or use the menu (you will find the Variables Manager under Data).

Still, everything you can do with the Variables Manager you can also be achieved via written commands, and this is often easier. You will find some of the pertinent commands in the entries of the section "Handling Data Files".

© W. Ludwig-Mayerhofer, Stata Guide | Last update: 19 Jul 2020