Data and Storage Types

Typically, we think of (quantitative) data as numbers. But this it not the whole story, in at least two ways. First, you may wish (or may have) to use data that contain "alphanumeric characters", or letters, as humans sometimes say. Second, "numbers" can mean different things. This will be explored in the first section.

Another thing to have in mind is the way numbers (or strings) are represented "internally", i.e. in the memory or on the hard disk of your computer. This is the topic of storage types, to which we will get back in the second section. Normally, this is stuff for advanced users, but there may be situations where even beginners have to know something about this issue.

Data types

The basic issue has already been pointed out in the introduction: In addition to numeric variables, your data set may contain variables that consist of other characters, particularly letters. Such variables are called string variables. Actually, they may contain numbers as well; they may even consist of numbers only. But these numbers cannot be used as numbers, that is, you may not perform any mathematical operations on them.

Numeric variables

The most common type of numeric variables is to represent something that has been "measured". This may be income; here the numbers stand for "real" numbers, i.e., 1200 Euros (or dollars) typically will be represented as 1200. In the social sciences, we often use measures for opinions or attitudes; here, respondents state to which extent they agree or disagree with a series of statements (often called "items"). The extent of agreement likewise is represented numerically (with 1 standing for, say, "completely disagree" and 5 for "completely agree". Finally, numbers often are used to represent different entities. For instance, in the ISSP data, there is a variable indicating the country from which respondents come. Here, "1" stands for Australia, 2" for West Germany", and so on.

With numeric variables, you can perform any computations, meaningful or not. For instance, Stata will compute the mean of the variable representing the country respondents come from, should you so desire. In case you should try to hand in a paper to your professor in which you report this "mean country" value, your academic career may be in severe danger. So ... it's always up to you to know what your data are about and how to handle them. This is what you will learn in a statistics course.

A special type of numeric variables represents time. Information about time may refer to dates as well as to time in terms of hours, minutes, seconds. There are specific ways for Stata to handle such information. Note that depending on the circumstances, using these specific features may not be absolutely necessary. For instance, you may have one variable representing the year and another representing the month something happened. These variables can be used like any other numeric variables. But with Stata's "datetime" features, some computations, as well as the representation of data to the human eye (and mind) may become easier.

String variables

In earlier versions, a string variable could contain up to 244 characters. I am not quite sure what "earlier" means exactly.

But note that as of version 13.0, Stata offers two types of string variables: "Short" string variables with up to 2045 characters. Typically, the type is denoted by "str#", with "#" being replaced by an appropriate number indicating the maximum length of the string. so, "str26" may contain up to 26 characters.

The other type of string variables is called "strL". These variables may contain up to 2 billion characters! However, you can use this type also for strings with 0 to 2045 characters.

Storage types

Numbers (including datetime variables) can come in various sizes. There are variables that consist of numbers 0 and 1 only. Others, like income, go into tens of thousands; and others again may represent precise measurements with a large number of decimal places.

Typical storage types are:

byte integer values between -127 and 100
int integer values between -32,767 and 32,740
long integer values between -2,147,483,647 and 2,147,483,620
float real numbers (i.e. numbers with decimal places) with about 8 digits of accuracy
double real numbers (i.e. numbers with decimal places) with about 16 digits of accuracy

Knowledge about storage type may be important particularly in two cases, which will be described in turn.

Size of data set

With modern computers, dataset size rarely is a problem. But as computer power increases, user demands increase as well. In the social sciences, ýou may encounter datasets that contain 1,000,000 cases or even more. Here, the size of your dataset may become a problem, and you may wish to think about the space each variable requires. Many variables may be represented by the byte, int or long types, which require much less space than the float or double types. But often numeric variables are represented by the float type. Note particularly that if you create new variables from within Stata with the help of generate/replace, the new variable will be of the float type by default.

If you have loaded your data set into memory, there is a simple way to try reducing the size of your data set. Just type

compress

and Stata will check whether there are variables that may need less computer space than is provided for in the storage type and will change the storage type accordingly. You will be informed about this in the results window.

Other possibilities to influence the size of your data set via the storage type concern the initial set-up of your data (something not [yet?] covered here)

Relational operators

Ironically (at least to beginners), the more space a variable requires the less precise it is represented. E.g., a value of 6.4 may actually be represented internally as 6.40000001 or 6.39999999 if the pertinent variable is defined as a float variable. This may become a problem in expressions such as

generate test = 1 if var17 <= 6.4

as values not exactly equal to 6.4 (e.g., 6.40000001) may be excluded from the range of valid values against your will.

The solution provided by Stata is:

generate test = 1 if var17 <= float(6.4)

Here, values such as 6.40000001 or 6.39999999 will be "rounded" internally, as it were, and thus will be treated as exactly equal to 6.4; accordingly, a value of 6.4000001 will fulfil the condition of being less or equal to 6.4.

Change storage type

Change numeric or string types

The storage type of variables can be changed with the help of the recast command. Here, numeric variables can be changed to a different numeric type, and string variables to a different string type (i.e. different string length).

recast long income

will ask Stata to change the storage type of variable income to "long". Whether or not Stata will do what you want depends on the circumstances:

  • You can always change a numeric variable to a more complex type, e.g., from "byte" to "int" or from "long" to "float".
  • You can change a numeric variable to a less complex type (e.g., from "float" to "int")
    • if this causes no loss of precision (for instance, a float variable the values of which actually contain no decimal values can be changed to an integer type if the range of values permits, whereas in the presence of decimal values Stata will refuse to change the storage type)
    • or if option force is used (here Stata will change the variable type no matter whether or not this entails a loss of information).

In the same vein, note for string variables:

  • You can always change a string variable to a longer string type.
  • However, you can only change a string variable to a shorter string type
    • if this entails no loss of information
    • or if option force is used.

Change numeric to string type or vice versa

The tostring, destring, encode and decode commands are available for these more complex changes. There basic function can be described as follows:

  • destring is for variables that contain both numeric and nonnumeric characters (e.g., percentages together with a "%" sign). It will remove the nonnumeric characters, leaving you with the numeric values only.
  • tostring will use the numeric values of a variable and will simply make strings of them. That is, a value of 10 will still look like 10 afterwards, only it will be a string. Typically, this will not work for "float" or "double" data.
  • encode will change a string variable into a variable with numeric values. Each different string will be replaced by a numeric variable; the previous (string) values will be attached as labels to the numeric values.
  • decode will work the other way round. It assumes that a numeric variable has labels, and it will use the labels to create the new (string) values of the variable.

Typically, these commands require either a replace or a generate option. For more detail, I have to refer you to the help system or the handbook. A teeny weeny little bit more can be found in a special entry in the section about data transformations.

© W. Ludwig-Mayerhofer, Stata Guide | Last update: 02 Aug 2015