Keyword "by" for Subgroups

To do something not on the entire dataset, but rather on subgroups, keyword by is used. It is most useful for data transformations, but of course it may also be used to do analyses by subgroups.

General

The general form to deal with by is to use it as a prefix. Thus, writing

by country: some Stata commmand(s)

whatever is achieved by "some Stata command(s)" is accomplished separately for all groups defined by variable "country".

Note, however, that this presupposes that the data are sorted by "country". If this is not the case, you may use the sort command prior to executing the command beginning with by. But you may also build it into the by prefix, as in:

by country, sort: some Stata commmand(s)

or, equivalently

bysort country: some Stata commmand(s)

Data transformation

Some Stata commands that may be useful for data transformation do not relate to a single row of the data, but rather to the dataset in its entirety. For instance,

egen tinc = total(income)

will compute the sum of income over the entire dataset and will store the result in a new variable called tinc. Note that each case (=row) in the dataset will have the same value in this variable, to wit, the total of all incomes.

Of course, this is something you will not normally wish to do. However, a common situation is that you have data which were collected on households, and in each household all adult persons were interviewed concerning, e.g., their individual income. In such a situation, you may wish to compute the household income, i.e., the sum of all individual incomes in each household. This is easily accomplished, provided that a variable indicates to which household each person belongs:

bysort ID_hh: egen hhinc = total(income)

There is a number of keywords that may be used in this way, e.g., mean, median, std, min, max, or pctile, p(#) (for other values than the median). Note in particular:

bysort ID_hh: gen nhh = _n

will create a variable "nhh" that indicates the position of each case within "ID_hh"; that is, the first case will have the value of "1", the second of "2", etc. In contrast,

bysort ID_hh: gen Nhh = _N

will compute the overall number of cases within "ID_hh"; assuming that "ID_hh" refers to households, the new variable will indicate the household size.

Data analysis

Sometimes you may wish to repeat an analysis for subgroups in your data. This is easily done by putting "by ..." in front of the respective command. For instance, to repeat a regression analysis by country, supposing that the respective variable indeed is named "country", you simply have to write:

by country, sort: regress income education tenure

If the data are already sorted by country, it is not necessary to use the sort option, as has been done in this example. However, I've noted that Stata did not recognize data that were sorted (perhaps the problem was that there were "gaps" in the variable, i.e. codes for country were not 1, 2, 3 etc. but rather 2, 3, 5, 8, etc.). As long as your data set is not too large, it is no problem to use the sort option here – Stata is very fast. Of course, there is also an option to sort data prior to doing your analyses.

Note that by does not work with all commands. Particularly in the case of graphs, often the option over has to be used for comparing groups. Some procedures simply do not allow for comparison of groups; here, typically you have to resort to the if clause.

Whenever I am aware of a prcocedure not permitting the use of by, I will indicate this in the heading of the respective entry.