Univariate Charts
This entry introduces some basic univariate charts, presenting an overview of the most basic commands. Only occasionally options to adapt or refine the graphs are mentioned, so be sure to read the other entries in the "Graphs" section which hopefully contain helpful information about options for graphs.
The contents of this entry are organized into subchapters as follows:
- Charts for discrete data (
graph bar
,catplot
,tabplot
and discrete histograms) - Cleveland dot plots
- Stem-and-leaf diagrams and box plots: Exploratory data analysis graphs
- Strip plots, histograms, dot plots, kernel density and violin plots: Data and distributions
- Quantile plots / cumulative density plots
Bar charts for discrete data
Before starting, I want to direct your attention to Nicholas J. Cox' (2004a) paper on graphs for categorical variables. You'll find interesting discussions and helpful examples.
Bar charts are typically used to display the frequencies or the percentages of categories of a variable. The bars may be very slim (looking like spikes), even though this option is rarely used. Stata offers several ways to create bar charts, and these produce rather similar (though not identical) things if you are dealing with a single variable. Larger differences appear once you start dealing with two variables at the same time.
Procedure graph bar
By default Stata deploys bar charts to show the mean values of variables. these charts are quite ugly, and Cleveland dot plots should usually be preferred. However, when used together with over
bar charts for percentages (or frequencies, for that matter) of categories will be displayed.
graph bar, over(nrooms)
is the example I have used to show the distribution (in percentages) of the number of rooms of the respondents' flats (from a small data set simulated after real data). Frequencies instead of percentages can be obtained with graph bar (count)
.
I have used several options in creating this graph; here are two that are specific to bar charts:
, b1title("Number of rooms")
to display a description of the variable displayed,
, bar(1, bfcolor(white) blcolor(black))
to address the filling colour and the line colour (by default, the lines delineating the bars are in grey).
In addition, I'd like to mention option , gap(#)
that influences the width of the bars (by influencing the gaps between those). This is a suboption for over()
, i.e., it has to be placed after "over", withing the parentheses around the variable name, as in over(nrooms, gap(#))
. You'll have to experiment a little, but note that something like gap(5000)
will transform the bars into spikes. (However, if you really want spikes, you may use spikeplot nrooms
instead of graph bar
in the first place.)
In some cases (many categories, long variable labels) a horizontal bar chart may be preferable. Use option , horizontal
to get this type of bar chart and , l1title("...")
for the description of the variable shown. Alternatively, use command graph hhbar
, omitting option , horizontal
.
Further options are described in the other entries of the "Graph" section, particularly those about titles, legends and notes or about axes.
Procedure catplot
catplot
is an ado file which has to be downloaded. Use findit catplot
for downloading this file to your computer.
catplot nrooms, percent recast(bar)
will achieve the same result as the graph bar
command outlined above. However, the name or the label of the variable displayed is used as a title for the x axis. Note that catplot can also be used to produce two-dimensional graphs.
Procedure tabplot
Another procedure you have to install separately. Use search tabplot
and download the latest version. This procedure is of interest particularly for bivariate or trivariate relationships, but it has something to offer for univariate display as well.
The main difference to similar procedures is that there is no scale against which to measure the height (or the length) of the bars. Just invoking tabplot varname
will only produce bars of different height, which of course may be all you want (the relative height of the bars being what people will be most interested in). If you want more, i.e. numbers associated with the bars, you can have them displayed beneath the bars, as in the following:
tabplot nrooms, percent showval( , format(%2.0f)) xtitle(" " "Number of rooms") subtitle(" ") scheme(s1mono) plotregion(lstyle(none))
Let me highlight two options: First, showval( , format(%2.0f))
will display the percentages underneath the bars, with the format part suppressing the decimal places (you can omit the parentheses after showval to obtain the default, i.e. numbers with one decimal place). Second, the subtitle
option as used here suppresses the word "percent" which otherwise would appear on the top left of the chart. Of course, you may just as well choose a different subtitle by putting it between the double quotes, or omit "subtitle" entirely to display the default.
What if you want a different look of the bars? To achieve this is a bit tricky: You have to use option , sep(varname)
which tells Stata that each bar of the variable under consideration is to be treated as a separate entity. Now, you may add option barall()
to change the look of all the bars (which is what you will normally want to do with a univariate diagram). For example, barall(bcolor(white) blcolor(black))
will yield bars like those in the graph bar
example further above. If you wish, you may address each single bar by bar1()
, bar2()
... up to bar20()
. (In case you want to change the look of all the bars and at the same time highlight a particular bar, you might use first a barall()
clause followed by bar#()
, with # replaced by the number of the particular bar. For instance, bar6(bcolor(red))
would show the sixth bar in red.
Histograms for categorical (discrete) variables
Histograms are typically used for metric variables (see further below), but a command such as
histogram nrooms, discrete gap(20) percent
will create a chart that is quite similar to a bar chart.
gap
refers to the gap between the bars; as you may expect some trial and error may be required here depending on your data. Actually, the number in parentheses indicates that the width of the bars will be reduced by so-and-so-many per cent.
While the start()
option (described below) has no effect here as far as I can see, the width()
option may be used for grouping categories. In my view, however, this is against the 'logic' of discrete charts.
Cleveland dot plots
A Cleveland dot plot looks like a bivariate plot, and is sometimes used as such, but in its typical usage it is better considered (in my opinion) as a univariate plot where the different data values, represented by dots, carry identifying labels. A very different type of dot plot is shown in a later subsection.
Note that the terminology concerning dot plots or dot charts is not firmly established, occasional claims to the contrary notwithstanding. Both the plots in this subsection as well as several other types of plot can be found under the headings of "dot plots" or "dot charts". However, the term "Cleveland dot plot" (after statistician and data visualisation expert William S. Cleveland) has been coined to highlight this particular type of graph, and I follow this usage. The Stata User's Guide does not use this term; the examples shown there do not refer to the display of single data values but rather of summary statistics such as means or percentages over groups formed by another variable.
Here is an example of the basic Cleveland dot plot (with single data values). I use data from the OECD on labour force participation of women aged 25 to 49 in 2007 (the data can be found in an introduction to Statistics I wrote in German with two colleagues, Ludwig-Mayerhofer et al. 2014, on p. 51). In this example, the data are ordered from the highest to the lowest by way of suboption (to option "over") , sort(1) descending
(with a country with a missing value on the bottom), but they might also be ordered the other way round, or alphabetically.
graph dot lfp07_age2549, over(country, sort(1) descending label(labsize(*.8))) scheme(s1color) plotregion(lstyle(none)) ytitle(" " "Female labour force participation") ylabel(55 (5) 80) exclude0
Option exclude0
ensures, together with ylabel(55 (5) 80)
that what Stata calls the y axis starts at 55. Calling this the y axis is entirely appropriate, as technically speaking what we see is a bivariate plot. And of course we may use this graph to think about which "country factors" might actually influence female labour force participation (we see that the Scandinavian countries are on the top, and the Southern European countries [and Japan] on the bottom). Still, I would say that primarily this is a(n ordered) list of data values, together with labels.
Note that graph dot
actually computes and displays the mean of the variable under investigation (labour force participation) for each country. But in this example, countries are cases; i.e., there is only a single row of data for each country, and therefore the "mean" is identical with that single value. Of course, the same graph might have been obtained from a data set with appropriately coded individual level data (i.e. data with several hundred or thousands of cases from each country)
graph dot
can also be used to display several variables; here is an example with just two variables from a survey, the number of people in the respondents' households and the number of rooms of the apartments the respondents were living in:
graph dot hhsize nrooms, ascategory scheme(s1mono) plotregion(lstyle(none)) ylabel(,angle(0))
where option ascategory
ensures that the two values are shown on two different lines.
A further possibility is to display several values from the distribution of a variable. I combine this feat with using two grouping variables, education and sex/gender, and request the median and the first and the last decile of equivalent income for each group.
graph dot (p10) equivinc (p50) equivinc (p90) equivinc , over(edu) over(sex) scheme(s1color) plotregion(lstyle(none)) legend( label (1 "p10") label (2 "Median") label (3 "p90") )
Exploratory data analysis graphs
This is a very brief section for the sake of completeness. More on these graphs can be found in the EDA entry, links to which are given below.
Stem-and-leaf diagrams
Stem-and-leaf diagrams are an important tool of exploratory data analysis (EDA), and should be considered seriously by any data analyst. They are outlined in the EDA section I have just referred to. The basic command to obtain such a plot is simply
stem income
Box plots (box-and-whisker plots)
graph box income
will produce a box-and-whisker plot of variable "income".
You may list more than one variable, if desired (and appropriate). However, the box-and-whisker plot is mostly used for comparing groups, as is shown in the entry about bivariate graphs.
Data and distributions (further plots used mostly for metric variables)
Many graphs have been developed that attempt to give an approximate view of the distribution of a variable (the box-and-whisker plot actually should be counted among these). I start with a plot that tries to show all single data values and proceed to plots that present more abstracted views.
Strip plots
Strip plots (or strip charts) display each single data value as a dot. However, in contrast to dot plots, the dots are not stacked to resemble a histogram; they are (typically) placed all besides (or above/beneath) each other, on a single "strip" as it were.
The strip plot ado file must be downloaded to your system if you haven't done so before:
ssc install stripplot
The following plot was created by
stripplot equivinc, vertical ytitle("Household equivalent income") ylabel(,angle(0)) scheme(s1mono) plotregion(lstyle(none) )
Note that command stripplot can be used to produce dot plots as well. For instance, the following command creates a dot plot that is analogous to the one shown further below:
stripplot equivinc, stack width(500) vertical ytitle("Household equivalent income") ylabel(,angle(0)) scheme(s1mono) plotregion(lstyle(none))
Histograms
Histograms are common, but they are also treacherous (as will be demonstrated shortly). Most professional statisticians prefer other graphs in most circumstances, but sometimes a histogram can be useful.
histogram equivinc, percent fcolor(gs12) lcolor(white) ylabel(,angle(0))
(plus some further options) was used to create a distribution of household equivalent incomes in Germany (older data in Deutschmark; only an approximation of real data):
However, a histogram of the same data could also look like this:
The first histogram used Stata's defaults, the second one was based on indications of the intervals into which values are grouped, plus a starting point for the first interval, which can be obtained as follows:
histogram equivinc, percent width(500) start(0)
As an alternative to width
, you may use bin( )
, indicating the number of, well, bins, i.e., intervals or groups. Be it as it may, the "binning" of data points into groups defined by (more or less) arbitrarily chosen intervals is a drawback of the histogram.
Note that instead of percentages, you might wish to show probability density values (the default; i.e., you'll just drop , percent
), fractions (use option , frac[tion]
) or frequencies (, freq[uency]
). This will only change the values on the vertical axis, not the shape of the histogram.
Finally, you may wish to use an option to add a density to the graph, in particular , norm[al]
or , kden[sity]
There are a few further options, particularly with respect to the display of the lines of the normal density or the kernel density estimate, which would lead us astray if explained at length here.
A little known alternative is a histogram whose bars represent chunks of the data with equal probability, which of course will result in bars of different width in virtually all (real) cases. Procedure eqprhistogram
can be downloaded with the command
ssc install eqprhistogram
and a command like the following can be used to create the histogram:
eqprhistogram equivinc, bin(10)
Option bin(#)
is of course not required, but it is useful here to determine the proportion of data that are represented by the bars.
Cox (2004b) discusses some additional varieties of the histogram.
Dot plots
"Dot plot" can mean several things, but in statistics it's most often used to denote sort of a histogram that uses, well, (stacked) dots instead of bars. The version implemented in Stata displays the (grouped, or "binned") values of the variable under investigation on the vertical axis and the frequencies on the horizontal axis.
Graphs of this type are created by the dotplot
command. In contrast, graph dot
creates what occasionally is referred to as Cleveland dot plots, described above.
As an aside, note that not everybody adheres to this conceptual distinction. For instance, a very good text on Cleveland dot plots (Jacoby 2006) uses only the simpler term "dot plot". Another text (Wilkinson 1999) about "dot plots" refers to the kind of plot described here (however, Wilkinson argues that the "real" dot plot is one that foregoes binning and suggests that dot plots that group the data be called "histodot plots").
The following example shows a dotplot of household equivalent income of 100 cases (data are simulated after real data); data were grouped into 20 bins (some of which are empty). Think about it as a histogram that is turned clockwise by 90 degrees. Note, however, that the lower values are at the bottom; that is, if you could turn the graph counter-clockwise (which unfortunately you cannot do in Stata afaik) the histogram would start with the highest values on the left hand side.
The command to create this plot is:
dotplot equivinc, ny(20) scheme(s1mono) plotregion(lstyle(none)) ytitle("Househould equivalent income") ylabel(,angle(0))
ny(20)
tells Stata to build 20 groups or "bins" from the variable under investigation (the default is 35). The remaining options change the look of the graph.
Another interesting option could be center
which centers the dots instead of aligning them on the left.
Note: As mentioned before, command stripplot may also be used to produce plots like the one shown here.
Kernel density estimators
Only one variable can be specified (even though formally Stata defines this as a twoway graph). The density estimated and displayed may be influenced by the "bwidth" option, as in:
kdensity equivinc, bwidth(500)
The value to be chosen for the bandwidth, or smoothing parameter, depends on your data; you may have to play around a bit. Normally, the value chosen by Stata is a good starting point (actually it's quite near to 500 in my example).
Another issue is the kernel to be used. By default, Stata uses the Epanechnikov kernel which from a theoretical viewpoint can be shown to minimize the error when using kdensity to estimate the underlying population distribution. Yet, sometimes other kernels may be used.
kdensity equivinc, kernel(biweight)
will use the biweight kernel. Some other well-known kernels that are available are epan2
(a simplified version of the Epanechnikov kernel), gaussian
or triangle
.
Violin plots
A violin plot combines a box-and-whisker plot with a density estimator (for details see Hintze & Nelson 1998).
The violin plot is not part of the package you obtain from Stata Corp.; and ado file must be downloaded to your system:
ssc install vioplot
The following plot was created by
vioplot equivinc
plus some options.
Quantile plots /cumulative density plots for metric or ordinal data
presentation by Nicholas J. Cox, who generally is a great source for (and of) graphs in Stata.
Stata's quantile plot commands can be found under the heading of "diagnostic plots" (typing help diagnostic plots
will show the way). Several ado files offer alternative routes to established goals or additional solutions.
Stata's diagnostic plots
The following offers only a few very common charts.
Quantile plots
Here, the ordered values of the variable under investigation are shown, plus a reference line that corresponds to a uniformly distributed variable.
The following plot was created by
quantile equivinc
It can be seen that, for instance, the first quartile is in the region of 1,500, and the median somewhere near 2,300. From this, it follows that the distribution is very skewed; whereas about one half of the data is between 0 (actually the lowest value is a bit higher) and 2,300, the other half is between 2,300 and over 6,000 (or 8,000 if we include the outlier). It can also be seen that the data are far from being uniformly distributed. (Actually, few data are.) Note that a "normal probability plot", created by pnorm equivinc
, compares the empirical distribution against a normal distribution;
Quantile-quantile plots
These plots graph quantiles of one variable against those of another variable. If the variables have the same distribution, the quantiles correspond to each other; the plot forms a straight line (this reference line is shown in the graph).
The following plot comparing (net) incomes of male and female full time workers (in Deutsche Mark; data are from the year 2000) was created by
qqplot incf incm
(plus some options). We see that incomes of male workers are considerably higher than those of female workers; e.g., the cumulative fraction that corresponds to 6,000 DM in the case of men has an income of only 4,000 DM in the case of women. (Note that while there is a gender pay gap in Germany, it is typically even larger in net incomes because of the specifics of German tax law.)
Nicholas J. Cox' qplot
This is an alternative to quantile
, without the latter's reference line. To install it, you have to search qplot
and look which of the links presented offers the latest update. The procedure has several additional features which are easily accessible via help qplot
once the ado file is installed.
Using cdfplot and distplot
The commands discussed thus far, quantile
and qplot
, show the values of the variable under investigation on the vertical axis. Cumulative density functions, however, are often displayed the other way round: The values on the horizontal axis and some associated probabilities on the the vertical axis. Such plots can be obtained with cdfplot
(written by Adrian Mander) or distplot
, again a result of Nicholas J. Cox' seemingly unexhaustible productivity.
cdfplot
Procedure cdfplot
is most apt for categorical (ordered) variables, but in principle, you may use it for any type of variable you like. It uses a step function to connect the values of the c.d.f. After installing the ado file with
ssc install cdfplot
you may obtain a cdf plot:
cdfplot education
You may compare groups, as in
cdfplot education, by(sex)
Further possibilities (see help cdfplot
): You may include, by way of comparison, the c.d.f. of a normally distributed variable, and there are some further options.
distplot
Another package that is not part of Stata's standard distribution. Type findit distplot
, look for the most recent entry on "Software update for distplot" and follow the link provided. Note that the syntax of distplot
has changed over time; what I describe here is working in 2025 with Stata version 16. The simplest version of the command is as follows:
distplot income
With discrete (ordered) variables, you may wish to the display the distribution as a step function, which can be achieved as follows (the resulting graph will look very much like the one created by cdfplot
):
distplot education, connect(stairstep)
The option to change the style of the connecting line is a specific feature of distplot
(stepstair
is a further style available, but it is not recommended in the context of c.d.f.s).
By default, the c.d.f is computed and displayed in terms of fractions. To obtain percentages, write:
distplot education, connect(stairstep) trscale(100 * @)
trscale
obviously is for "transform scale". The "at" sign stands for the original values, i.e. the fractions; other transformations may be used.
distplot
may also be used to compare the c.d.f.s of two or more groups as follows (note that (stairstep ..)
is a shortcut for writing out stairstep three times [one for each line]):
distplot equivinc, over(bildgr) connect(stairstep ..) ylabel(,angle(0)) scheme(s1color) plotregion(lstyle(none)) xt(" " "Education")
distplot
has a few features that could be worth investigating; for more information, see help distplot
Reference
- Cox, Nicholas J. (2004a): Speaking Stata: raphing categorical and compositional data, The Stata Journal, 4 (2), pp. 190-215.
- Cox, Nicholas J. (2004b): Speaking Stata: Graphing distributions, The Stata Journal, 4 (1), pp. 66–88.
- Hintze, Jerry L./Nelson, Ray D. (1998) Violin Plots: A Box Plot-Density Trace Synergism, The American Statistician, 52:2, 181-184, DOI: 10.1080/00031305.1998.10480559.
- Jacoby, William G. (2006): The Dot Plot: A Graphical Display for Labeled Quantitative Values, The Political Methodologist. Newsletter of the Political Methodology Section, American Political Science Association, Vol. 14, Number 1, pp. 6-14.
- Ludwig-Mayerhofer, Wolfgang/Liebeskind, Uta/Geißler, Ferdinand (2014): Statistik. Eine Einführung für Sozialwissenschaftler, Weinheim, Basel: Beltz Juventa.
- Wilk, M. B./R. Gnanadesikan (1968): Probability plotting methods for the analysis of data, Biometrika 55, pp. 1–17.
© W. Ludwig-Mayerhofer, Stata Guide | Last update: 17 Apr 2025