Reading Data

Stata data sets

To open an already existing Stata system file (with extension ".dta"), the appropriate command is

use "name-of-data-file"

If you have been working with another data set that is still in memory, you have to write:

use "name-of-data-file", clear

clear can also be used as a stand-alone command prior to use.

Note the double quotes I have enclosed the file name with. These quotes may be omitted if the name of the data set does not contain any empty spaces (blanks). But since it has become possible to use file names old people like me would never have dreamt could be allowed on any computer system, it's probably good advice to always use double quotes, as they will never hurt. (Single quotation marks won't do the trick.)

Up to now, I have assumed that the data are in the default working directory, which afaik is the "documents" subdirectory of your personal directory on a Windows PC. If the data set it to be found elsewhere, you may write, for instance

use "c:\mydirectory\mysubdirectory\name-of-data-file"

where you have to fill in your directory and data set name. Another way is to change to the pertinent directory first and then to "use" the data file:

cd "c:\mydirectory\mysubdirectory\"

use "name-of-data-file"

If you make frequent use of files from other directories, it may be helpful to define the paths to these directories as global macros.

Parts of a Stata data set

If you know from the outset that you need only parts of a data set, you may request Stata to limit the data to be loaded. "Limiting" the data may refer to the variables used and/or to the selection of a subsample of cases. Look at the following examples:

use var1 var17 var38 using "name-of-data-file"

will load only the three variables mentioned into your working memory.

use if id <= 1000 using "name-of-data-file"

will load only cases with a value less than or equal to 1000 in variable id.

Both types of command may be combined, such as in

use var1 var17 if id <= 1000 using "name-of-data-file"

Data from other statistics software

As of version 16, Stata can import data sets that have been created by the SPSS or SAS packages. Actually, SAS files could be read from version 12 onwards; for SPSS one or two ado files were available since about version 10.

Data from other statistical packages may be converted to Stata data sets with the help of Stat/Transfer or perhaps similar software I am not aware of. (There used to be a program called DBMS copy, but this has been defunct for quite some time afaik.)

IBM SPSS Statistics

Stata version 16 or higher

Stata's import command, introduced in version 13 if memory serves, has been extended to include SPSS files as of version 16. Typical uses may be:

import spss "name-of-data-file"

import spss var1 var17 var38 using "name-of-data-file" // reads only var1, var17 and var38

Note that name-of-data-file may include the name of a path to the directory in which the file is located. Note further that you may have to use the , clear option, depending on whether or not another data set is currently open.

By default, this command assumes that the SPSS data are in .sav format and were created by SPSS version 16 or higher. Data in .zsav format (SPSS version 21 or higher) can be imported by adding the option , zsav.

Earlier Stata versions

Two user-written procedures were (and still are) available for earlier versions: usespss can still be obtained via net from http://radyakin.org/transfer/usespss/beta (last time I checked this as mid-February 2024), and it seems to work with Stata 16, 64 bit version. Quite some time ago, possibly with Stata 12, I successfully used an earlier version of this procedure.

I never used importsav, the other user-written procedure. To obtain more information, use Stata' search function: search importsav. The procedure requires that you have a version of the R software installed on your computer. The help file says that it works with Stata version 10 or higher.

Note that if you have access to SPSS, you may save your file in Stata format and then use this version in Stata. However, it's a long time since I quit working with SPSS, so I can't reliably tell you something about the current state of things.

SAS

Stata version 16 or higher

As of version 16, Stata offers three commands to import differenty types of SAS file:

import sas "name-of-data-file"

import sasxport5 "name-of-data-file"

import sasxport8 "name-of-data-file"

The first command can read SAS files created by version 7 or higher (.sas7bdat). The other two commands can import what is called a SAS Transport file, created by SAS XPORT version 5 or SAS XPORT version 8, respectively.

Note the extension of these commands to read only a selection of variables as, e.g., in

import sas var1 var15 var30 using "name-of-data-file"

Plus, do not forget to add the option , clear if another file is currently open.

Earlier Stata versions

With Stata versions 13 to 15, SAS "Transport" files could be read via the command

import sasxport5 "name-of-data-file"

In still earlier versions of Stata such data could be read with the fdause command (the name of the command is derived from the fact that the US FDA requires this format). help fdause should provide the necessary information.

Data from spreadsheets and ASCII (text) files

Importing Excel™-files (Stata 12 and higher)

import excel "name-of-data-file", firstrow clear

will import the first sheet from file "name-of-data-file", assuming that the first row contains the variable names. If the data you wish to import are not in the first sheet, try adding the option sheet("name-of-sheet"). There are other options as well; e.g., you might restrict import to some rows and columns.

Stata can also export to Excel™ files, but since as yet I did have no reason to try this, please find out for yourself how it works (help export).

For earlier versions of Stata (that cannot read Excel data as such), you may consider that Excel can create so called ".csv" files, files with raw numbers / characters that are separated by delimiters (usually commas -- hence "csv" for "comma separated values). Look at the following paragraphs.

Text / ASCII files

Text or ASCII files can represent data by plain numbers or characters. They can come in different shapes:

Data separated by delimiters

Often (as in the .csv files mentioned above) the numbers/characters are separated by delimiters; by default, this delimiter is a comma (other separators are possible, the main caveat being that it is a symbol that cannot be part of a data value). So, a list of six variables from two individuals may look like this:

1, m, 35, 3700, 80000, 30
2, f, 25, 900, 0, 21

With Stata version 12 or higher, such data can be read as follows:

import delimited "name-of-data-file", clear

See help import_delimited or the Stata handbook for additional options.

For earlier versions of Stata, the insheet command provides similar functionality:

insheet using "name-of-data-file", c n clear

will read an ASCII file with comma separated values and names in the first line. Other options are:

t for tab delimited data
delim("X") for data delimited by X.

"X" may be exchanged by any other character.

The insheet command is no longer official part of Stata (as of version 13), but it still works. Look for more information with help insheet.

ASCII files in fixed or free format

Data need not necessarily be separated by a delimiter; a blank may be enough. We say that data are in fixed format if the data are aligned in a way that variables are stacked precisely above/beneath each other other, as in:

1 m 35 3700 80000 30
2 f 25 900 0 21

Such a format makes it easier to deal with missing values (they might even be represented simply by blanks, i.e. "nothing", even though I consider this bad practice).

In free format, the data may look just like those separated by a delimiter, but here the 'delimiter' is just a blank:

1 m 35 1700 80000 30
2 f 25 900 0 21

Here it is absolutely mandatory that each variable is represented in each single case by a value (number or character), at the least as a symbol for missing values. This format works because you inform Stata that there are (in our example) six variables per case (line), and thus Stata counts six consecutive data values as belonging to one case.

Current versions of Stata (i.e. version 16 or higher) use the infile and the infix command for such data. For more information, see help infile or help infix.

The older insheet command mentioned above may also be used for free format data (or fixed format data, provided each variable is represented by some data value for all cases). For more information, see help insheet.