Basic Univariate Statistics

Summarize

summarize var3

will display the number of observations for this variable, the arithmetic mean (commonly abbreviated as mean), the standard deviation and the minimum and maximum values. The standard deviation is calculated on the assumption that your data are a sample from a population and therefore is an estimation for that population and not simply the standard deviation of the data at hand. (In other words, when computing the variance, the denominator is not n, the number of cases in your dataset, but n-1.). Several variables can be listed, as in the following expanded example:

sum var1 var2 var3, detail

The option "detail" (abbreviated as "d") will cause Stata to deliver, in addition to the mean and the S.D., several further statistics: Various percentiles, the four smallest and the four largest values, the variance and finally skewness and kurtosis

Actually, quite a number of measures have been proposed in the literature for skewness and kurtosis, and particularly concerning kurtosis the implementation in Stata's summarize is somewhat unfortunate. Dirk Enzmann has written an ado file moments2 that allows you to pick the measure of your choice. Just type

ssc install moments2

to obtain this package. The help file will show you how to get the different measures.

Command tabstat is yet another way to compute a number of sample statistics. It offers more flexibility in the choice of statistics displayed. For the time being, please refer to the User's Guide.


Conditional description using tabulate

A related, but somewhat different possibility is to display summary statistics for a variable contingent on the values of another variable. Let's assume that variable "class" indicates the social class of the persons in your sample and "income" their income. Then

tabulate class, summarize(income)

will display the mean and the standard deviation of income, plus the number of observations, for each social class. As to the standard deviation see my remark above.

While there is a number of other possibilities to create an overview of means and S.D.s, an nice feater of tabulate is that it can display these statistics conditional on two variables combined. Thus,

tabulate class sex, summarize(income)

will display the income for each sex within each class. More than two variables, however, are not permitted.


The table command

Yet another way is to use the table command. It can be used to display up to five statistics per cell, with cells defined by the categories of one or two variables. Thus,

table school gender, content(freq mean math p10 math p50 math p90 math)

will display for the schools in your sample (row variable), separately for boys and girls (column variable), the number of cases (freq), the mean math score, and the 10th, 50th and 90th percentile of the math score. Note

  • that the option content may be abbreviated by c,
  • that any number between 1 and 99 can be used to obtain the respective percentile,
  • that other content is available, such as the interquartile range (iqr), minimum (min) and maximum (max), the sum (sum) or, if there are weights, the unweighted sum (rawsum), and finally the number of nonmissing observations (count or n).

Hint: If you need a percentile that is no integer, such as 2.5, you should try the _pctile command. This is a programmer's command, and hence the result must be requested from Stata with return list. Here's an example:

_pctile height, percentile(2.5 97.5)
return list

This will store the values of the 2.5th and the 97.5 percentile in a matrix with two elements, and upon submitting the second command, Stata will anwer:

r(r1) = XX
r(r2) = XX

where XX will be replaced by the value Stata has computed, of course.

© W. Ludwig-Mayerhofer, Stata Guide | Last update: 19 Aug 2013