Univariate Charts
This entry introduces some basic univariate charts. It presents an overview with the most basic commands. Be sure to read the other entries in this section which contain important information about options for graphs. The final two entries are devoted to more complex graphs, where several elements are overlaid or where several graphs are combined in a singple plot.
Bar charts
Bar charts are typically used to display the frequencies or the percentages of categories of a variable.
Procedure graph bar
By default Stata deploys bar charts to show the mean values of variables. However, when used together with over
bar charts for percentages (or frequencies, for that matter) of categories will be displayed.
graph bar, over(education)
will show the percentage of each category of variable "education". For frequencies, use
graph bar (count), over(education)
In some cases (many categories, long variable labels) a horizontal bar chart may be preferable. Use option horizontal
.
As an aside, a drawback of the graph bar
command is that the name of the variable is not displayed on the x axis. To do so, you have to use special commands for labeling axes, to wit, b1title("Any title you want")
for vertical bar charts or l1title("Any title you want")
in the case of horizontal charts.
Procedure catplot
catplot
is an ado file which has to be downloaded. Use findit catplot
for downloading this file to your computer.
catplot education, percent recast(bar)
will achieve the same result as the graph bar
command outlined above. However, the name or the label of the variable displayed is used as a title for the x axis. Note that catplot can also be used to produce two-dimensional graphs.
Cumulative distribution functions
There are various methods to obtion cdfs of empirical variables; each has its merits and its drawbacks. I discovered Adrian Mander's cdfplot
only recently; this should do in most cases. But I have not yet been able to explore it in all detail.
Using cdfplot
Procedure cdfplot
is most apt for categorical (ordered) variables, but in principle, you may use it for any type of variable you like. It uses a step function to connect the values of the c.d.f.
First, install the ado file:
ssc install cdfplot
Then you may obtain a cdf plot:
cdfplot education
You may compare groups:
cdfplot education, by(sex)
Further possibilities (see help cdfplot
): You may include, by way of comparison, the c.d.f. of a normally distributed variable, and there are some further options.
Using distplot
distplot
is part of a package written by Nickolas J. Cox. Type findit distplot
, look for the most recent entry on "Software update for distplot" and follow the link provided. Note that if I am not misakten in earlier versions the syntax was somewhat different; if what follows does not work in your case you may perhaps used such an earlier version (or a version that deviates in other ways from the one I am using).
distplot line income
With categorical variables, you may wish to the display the distribution as a step function, which can be achieved as follows:
distplot line education, connect(stairstep)
By default, the cdf is computed and displayed in terms of fractions. To obtain percentages, you have to use:
distplot line education, connect(stairstep) trscale(100 * @)
trscale
obviously is for "transform scale". The "at" sign stands for the original values, i.e. the fractions; obviously you may use other transformations as well.
distplot
may also be used to compare the cdf's of two or more groups as follows:
distplot line income, by(groupd) ytitle("cdf")
will display the cdf of income for the groups defined by variable "groupd" (of course further options may applied).
Histograms
histogram income
will display a histogram of, yes, "income".
You can obtain some control over your histogram with the following options:
histogram income, bin(20) start(0)
histogram income, width(500) start(0)
Option bin
refers to the number of, well, bins, and start
to the value at which the computation of the width of the bins will start (it will end automatically at or near the highest value encountered). As an alternative to the number of bins, you may indicate the width of each bin as in the second example.
Further options refer to the y-axis. By default, it will be drawn as density. You may also choose any of the following (the options are self-explaining):
histogram income, frac[tion]
histogram income, percent
histogram income, freq[uency]
Finally, you may add densities to the graph as follows:
histogram income, norm[al]
histogram income, kden[sity]
There are a few further options, particularly with respect to the display of the lines of the normal density or the kernel density estimate, which would lead us astray if explained at length here.
Histograms for categorical (discrete) variables
Histograms may be used for of categorical variables as well. The basic difference is that in this case values of the variable normally will not be grouped together; rather, they will be treated as separate categories, which should also influence the look of the graph. The main trick is the use of the discrete
and the gap()
options, as in:
histogram trust_pol, discrete gap(20) percent
gap
refers to the gap between the bars; as you may expect some trial and error may be required here depending on your data. Actually, the number in parentheses indicates that the width of the bars will be reduced by so-and-so-many per cent.
Whereas the start()
option has no effect here (as far as I can see), the width()
option may be used for grouping categories. In my view, however, this is against the 'logic' of discrete charts.
Kernel density estimators
kdensity trust
Only one variable can be specified. The look of the graph can be influenced by the "bwidth" option, as in:
kdensity trust, bwidth(.5)
The value to be chosen for the bandwidth, or smoothing parameter, depends on your data; you may have to play around a bit. Normally, the value chosen by Stata is a good starting point.
Another issue is the kernel to be used. By default, Stata uses the Epanechnikov chernel which from a theoretical viewpoint can be shown to minimize the error when using kdensity to estimate the underlying population distribution. Yet, sometimes other kernels may be used.
kdensity trust, kernel(biweight)
will use the biweight kernel. Some other well-known kernels that are available are epan2
(a simplified version of the Epanechnikov kernel), gaussian
or triangle
.
Box plots (box-and-whisker plots)
graph box income
will produce a box-and-whisker plot of variable "income".
graph box income1998 income2000 income2002 income2004, cw
will produce box plots of income in the sample over several years. Note the cw
, or casewise (deletion), option used here which causes Stata to use only cases with valid values for all variables.
Dot plots
graph dot income
will produce a dot showing the mean of variable "income".
The dot plot can be expanded or changed in various ways. You may list several variables, whose means will be shown along a single line. You may also request other values than the mean, as in, e.g.:
graph dot (p10) income (p90) income
wich will produce a plot showing the first and the last decile.
The dot plot is most useful for the comparison of two or more groups; see the respective entry on bivariate graphs.
© W. Ludwig-Mayerhofer, Stata Guide | Last update: 19 Mar 2019