String Variables

String variables, simply speaking, are variables that contain not just numbers, but also other characters (possibly mixed with numbers). For instance, the European Social Survey stores information about the country where respondents were surveyed in variable cntry, which contains strings like "DE", "ES, "LT" etc. Another term that is often used is "alphanumeric" variables, obviously referring the the "alphabet" and therefore to letters. But actually a string variable may contain characters such as "\" or "-" or ":" and so on — anything a keyboard may produce.

The first section will focus on the topic of transforming string into numeric variables. The second section will present some other stuff I found useful. But first of all I will issue a ...

Warning

As of version 14, Stata supports Unicode (UTF-8). If you don't know what that means — it's about how characters are coded digitally on your computer. (You've heard that computers use bits, bytes and such stuff to represent information. If you type and save an "a", you won't find an "a" on your hard disk, but some computer code that is transformed into an "a" on your screen or on the printer.) Unicode is much more universal than earlier codings such as ASCII or ANSII. It allows tons of characters to be represented, including many from the numerous languages that do not use the characters you are seeing here (derived from, and in most parts identical to those of, the Latin alphabet), but rather cyrillic, Thai, or whatever.

I will not go into detail here, but just want to alert you that the new possibilities to deal with Unicode characters are not covered here. Note particularly that for some string functions (such as substr()) there are now equivalent functions for dealing especially with Unicode characters. These functions have names that are composites of u plus the conventional function, e.g., usubstr(). Note that as long as you restrict yourself to characters from the original ASCII code (something I have been teaching and preaching for decades), there is no need to worry. But otherwise, be aware that some things might work differently with Stata 14.

For instance, German umlaute are represented differently in Stata 13 and Stata 14, and this has consequences beyond the display of characters. As an example, the (German) word für is a string of length three in Stata 13, but the string length is four in Stata 14. This also influences the results of functions such as strlen(). As strlen()refers to the memory used (and not the number of characters as they appear on the screen), the result of strlen(für) will also be 4 in Stata 14 in contrast to 3 in Stata 13. The new function ustrlen(für), in contrast, will yield 3. In other words, "ustring" functions refer to the number of characters as they appear to the human eye, not the the amount of memory needed.

From string to numeric variables

Even though Stata can handle string variables, it is clear in many respects that numeric variables are much preferred. Not least, most statistical procedures just do not accept string variables. Fortunately, Stata offers some easy ways for converting string to numeric variables (and vice versa).

"String only" variables

Sometimes, string variables represent propierties. This might be "male" and "female" (plus "gay", "queer", or whatever you please, even if this is still very rare in social science data). Or it could be the country codes mentioned in the introducation, that is, "DE", "ES", "UK", US" etc. If you need to transform this into a numeric variable each category should be represented by a different number. Changing from string to numeric is easy, with encode being the command of your choice:

encode cntry, gen(cntrynum)

will convert cntry to a numeric variable, with the characters from the former string variable as value labels.

If for some reason you want to convert a numeric variable into a string variable, you may use the complementary function decode. Not surprisingly, Stata requires the numeric variable to be labeled; these labels will be used as strings to represent the different values.

Numbers "disguised" as strings

A special case are variables where numeric values are stored as a string variable, including cases when the numeric values are stored together with some (irrelevant) characters. Procedure destring offers ways to convert such a variable to a numeric variable, leaving the original values unchanged (if they consist only of numbers) or removing any non-numeric characters.

The general form of the command is

destring varname(s), options

with at least one option being required. The most important options are:

generate(newvarnames): creates new variables
replace: replaces the old variable(s)
ignore("char1" ["char2 ..."]): removes characters "char1", "char2" etc.
force: changes all variables that contain characters not mentioned in force to missing

Not surprisingly, Stata also offers command tostring that works the other way round.

Data transformations for string variables

What follows is a quite heterogeneous collection. In particular, there are dozens of functions that refer to string variables, and I will cover only a very small and arbitrary selection.

split

split is a command that works on a string variable. Very obviously, it will split it into two ore more parts; it will do so creating new variables while leaving the old variable unchanged. It does so if there is something in the original variable that separates those parts. By default, this separator is a space. Thus, splitting a variable containing "Joe Brady" will result in two new variables, one containing "Joe" and the other containing "Brady". Note that if there's a person named "Joe F. Brady" the result will be three variables.

split name Variable "name" will be split into variables "name1", "name2" etc., provided it contains blanks, of course. The number of new variables will be equal to the number of blanks plus 1.

split name, parse(,) Variable "name" will be split using commas instead of blanks as separators. You may indicate several separators; also, separators may consist of more than one character.

split lnum, destring Parts of "lnum" that actually represent numbers will be turned into numeric variables — but only if there are numbers throughout for a given new variable. For instance, if variable "lnum" is "60 30" for one case and "50 ab" for another, new variable "lnum1" will be numeric (with values 60 and 50) while "lnum2" will still be a string variable (and while the value of this variable will be 30 for the first case, this will still be treated like a string)

split lnum, destring force All the new variables will be numeric. Yet, whenever a non-numeric piece of information appears, the value will be missing. For instance, if "lnum" is "60 b30", the value of "lnum2" will be missing.

split lnum, destring force ignore(abcd) All the new variables will be numeric. Any characters "a", "b", "c" or "d" will be dropped. Thus, in the previous example "b30" will be turned into "30" and the result will be indeed a numeric variable.

split lnum, generate(ln) Variable "lnum" will be split into variables "ln1", "ln2" and so on.

egen with ends()

This function will extract one "part" from a string variable (in contrast to the split command, which will create as many new variables as there are "parts"). What a part is is defined by a separator. Ironically, by default ends() will extract the first part (or head, as it is named in the Stata handbook). All in all, there are three possibilities:

egen firstname = ends(name) This will extract anything that appears in "name" before the first occurence of the separator, which be default is one blank space. This is equivalent to egen firstname = ends(name), head. If there is no separator, the whole string contained in "name" will re-appear in "firstname"

egen lastname = ends(name), last This will extract anything that appears in "name" after the last occurence of the separator. Again, if there is no separator, the entire content of "name" will be represented in "lastname".

egen endofname = ends(name), tail This will extract anything that appears in "name" after the first occurence of the separator. If there is no separator, the result will be an empty string.

The separator may be indicated by punct(,). Thus, egen firstname = ends(name) punct(,) will extract anything that appears in "name" before the first occurence of a comma. Other separators are possible, including those consisting of several characters, but you cannot indicate several separators.

strmatch()

This may be used together with generate/replace. It looks for a "word", or more exactly, a sequence of characters, as in:

gen newvar = strmatch(oldvar, "somecharacters")

Variable "newvar" will have a value of 1 if "oldvar" consists of the sequence "somecharacters". However, if you are just looking for "somecharacters" to appear somewhere in "oldvar", you may use

gen newvar = strmatch(oldvar, "*somecharacters*")

which will look for "somecharacters" to appear anywhere. Writing

gen newvar = strmatch(oldvar, "*somecharacters")

will make Stata look for "somecharacters" to appear at the end of "oldvar", no matter how many characters precede "somecharacters". Of course, "somecharacters*" may be used as well.