Basic Univariate Statistics
Summarize
summarize var3
will display the number of observations for this variable, the arithmetic mean (commonly abbreviated as mean), the standard deviation and the minimum and maximum values. The standard deviation is calculated on the assumption that your data are a sample from a population and therefore is an estimation for that population and not simply the standard deviation of the data at hand. (In other words, when computing the variance, the denominator is not n, the number of cases in your dataset, but n-1.). Several variables can be listed, as in the following expanded example:
sum var1 var2 var3, detail
The option "detail" (abbreviated as "d") will cause Stata to deliver, in addition to the mean and the S.D., several further statistics: Various percentiles, the four smallest and the four largest values, the variance and finally skewness and kurtosis
Actually, quite a number of measures have been proposed in the literature for skewness and kurtosis, and particularly concerning kurtosis the implementation in Stata's summarize
is somewhat unfortunate. Dirk Enzmann has written an ado file moments2
that allows you to pick the measure of your choice. Just type
ssc install moments2
to obtain this package. The help file will show you how to get the different measures.
Command tabstat
is yet another way to compute a number of sample statistics. It offers more flexibility in the choice of statistics displayed. For the time being, please refer to the User's Guide.
Conditional description using tabulate
A related, but somewhat different possibility is to display summary statistics for a variable contingent on the values of another variable. Let's assume that variable "class" indicates the social class of the persons in your sample and "income" their income. Then
tabulate class, summarize(income)
will display the mean and the standard deviation of income, plus the number of observations, for each social class. As to the standard deviation see my remark above.
While there is a number of other possibilities to create an overview of means and S.D.s, an nice feater of tabulate
is that it can display these statistics conditional on two variables combined. Thus,
tabulate class sex, summarize(income)
will display the income for each sex within each class. More than two variables, however, are not permitted.
The table command
Yet another way is to use the table
command. It can be used to display up to five statistics per cell, with cells defined by the categories of one or two variables. Thus,
table school gender, content(freq mean math p10 math p50 math p90 math)
will display for the schools in your sample (row variable), separately for boys and girls (column variable), the number of cases (freq
), the mean math score, and the 10th, 50th and 90th percentile of the math score. Note
- that the option
content
may be abbreviated by c, - that any number between 1 and 99 can be used to obtain the respective percentile,
- that other content is available, such as the interquartile range (
iqr
), minimum (min
) and maximum (max
), the sum (sum
) or, if there are weights, the unweighted sum (rawsum
), and finally the number of nonmissing observations (count
orn
).
Hint: If you need a percentile that is no integer, such as 2.5, you should try the _pctile
command. This is a programmer's command, and hence the result must be requested from Stata with return list
. Here's an example:
_pctile height, percentile(2.5 97.5)
return list
This will store the values of the 2.5th and the 97.5 percentile in a matrix with two elements, and upon submitting the second command, Stata will anwer:
r(r1) = XX
r(r2) = XX
where XX will be replaced by the value Stata has computed, of course.
© W. Ludwig-Mayerhofer, Stata Guide | Last update: 19 Aug 2013