Data and Distributions
The graphs in this entry are meant mainly for the display of metric variables
Many graphs have been developed that attempt to give an approximate view of the distribution of a variable (the box-and-whisker plot, shown elsewhere, actually should be counted among these). I start with a plot that tries to show all single data values and proceed to plots that present more abstracted views.
Please note: Most graphs in this entry have been created by using, among other options, the following: ylabel(,angle(0))
, scheme(s1mono)
and plotregion(lstyle(none))
. These are not repeated in the examples shown below.
Strip plots
Strip plots (or strip charts) display each single data value as a dot. However, in contrast to dot plots, the dots are not stacked to resemble a histogram; they are (typically) placed all besides (or above/beneath) each other, on a single "strip" as it were.
The strip plot ado file must be downloaded to your system if you haven't done so before:
ssc install stripplot
The following plot was created by
stripplot equivinc, vertical ytitle("Household equivalent income")
Bivariate strip plots
A bivariate strip plot can by created as follows, e.g.:
stripplot aeqeink, over(bildgr) vertical ytitle("Household equivalent income") xtitle(" " "Education") aspectratio(1.3)
Histograms
Histograms are common, but they are also treacherous (as will be demonstrated shortly). Most professional statisticians prefer other graphs in most circumstances, but sometimes a histogram can be useful.
histogram equivinc, percent fcolor(gs12) lcolor(white) ylabel(,angle(0))
(plus some further options) was used to create a distribution of household equivalent incomes in Germany (older data in Deutschmark; only an approximation of real data):
However, a histogram of the same data could also look like this:
The first histogram used Stata's defaults, the second one was based on indications of the intervals into which values are grouped, plus a starting point for the first interval, which can be obtained as follows:
histogram equivinc, percent width(500) start(0)
As an alternative to width( )
, you may use bin( )
, indicating the number of, well, bins, i.e., intervals or groups. Be it as it may, the "binning" of data points into groups defined by (more or less) arbitrarily chosen intervals is a drawback of the histogram.
Note that instead of percentages, you might wish to show probability density values (the default; i.e., you'll just drop , percent
), fractions (use option , frac[tion]
) or frequencies (, freq[uency]
). This will only change the values displayed on the vertical axis, not the shape of the histogram.
Finally, you can use an option to add a density to the graph, in particular , norm[al]
or , kden[sity].
A little known alternative is a histogram whose bars represent chunks of the data with equal probability, which of course will result in bars of different width in virtually all (real) cases. Procedure eqprhistogram
can be downloaded with the command
ssc install eqprhistogram
and a command like the following can be used to create the histogram:
eqprhistogram equivinc, bin(10)
Option bin(#)
is of course not required, but it is useful here to determine the proportion of data that are represented by the bars.
Cox (2004) discusses some additional varieties of the histogram.
Dot plots
"Dot plot" can mean several things, but in statistics it's most often used to denote sort of a histogram that uses, well, (stacked) dots instead of bars. The version implemented in Stata displays the (grouped, or "binned") values of the variable under investigation on the vertical axis and the frequencies on the horizontal axis.
Graphs of this type are created by the dotplot
command. In contrast, graph dot
creates what occasionally is referred to as Cleveland dot plots, described in a different entry.
As an aside, note that not everybody adheres to this conceptual distinction. For instance, a very good text on Cleveland dot plots (Jacoby 2006) uses only the simpler term "dot plot". Another text (Wilkinson 1999) about "dot plots" refers to the kind of plot described here (however, Wilkinson argues that the "real" dot plot is one that foregoes binning and suggests that dot plots that group the data be called "histodot plots").
The following example shows a dotplot of household equivalent income of 100 cases (data are simulated after real data); data were grouped into 20 bins (some of which are empty). Think about it as a histogram that is turned clockwise by 90 degrees. Note, however, that the lower values are at the bottom; that is, if you could turn the graph counter-clockwise (which unfortunately you cannot do in Stata afaik) the histogram would start with the highest values on the left hand side.
The command to create this plot is:
dotplot equivinc, ny(20) ytitle("Househould equivalent income")
ny(20)
tells Stata to build 20 groups or "bins" from the variable under investigation (the default is 35). The remaining options change the look of the graph.
Another interesting option could be center
which centers the dots instead of aligning them on the left.
Note that command stripplot
can be used to produce dot plots as well. For instance, the following command creates a dot plot that is analogous to the one just shown:
stripplot equivinc, stack width(500) vertical.
Bivariate dot plots
Like most graphs, the dot plot may be used to compare several groups, as in:
dotplot equivinc, ny(20) over(education) xtitle(" " "Education")
Kernel density estimators
Only one variable can be specified (even though formally Stata defines this as a twoway graph). The density estimated and displayed may be influenced by the "bwidth" option, as in:
kdensity equivinc, bwidth(500)
The value to be chosen for the bandwidth, or smoothing parameter, depends on your data; you may have to play around a bit. Normally, the value chosen by Stata is a good starting point (actually it's quite near to 500 in my example).
Another issue is the kernel to be used. By default, Stata uses the Epanechnikov kernel which from a theoretical viewpoint can be shown to minimize the error when using kdensity to estimate the underlying population distribution. Yet, sometimes other kernels may be used.
kdensity equivinc, kernel(biweight)
will use the biweight kernel. Some other well-known kernels that are available are epan2
(a simplified version of the Epanechnikov kernel), gaussian
or triangle
.
Violin plots
A violin plot combines a box-and-whisker plot with a density estimator (for details see Hintze & Nelson 1998).
The violin plot is not part of the package you obtain from Stata Corp.; and ado file must be downloaded to your system:
ssc install vioplot
The following plot was created by
vioplot equivinc
plus some options.
Bivariate violin plots
Like box blots, violin plots can be important for comparing groups, as in this example:
vioplot equivinc, over(education) ytitle("Household equivalent income") xtitle(" " "Education")
References
- Cox, Nicholas J. (2004): Speaking Stata: Graphing distributions, The Stata Journal, 4 (1), pp. 66–88.
- Hintze, Jerry L./Nelson, Ray D. (1998) Violin Plots: A Box Plot-Density Trace Synergism, The American Statistician, 52:2, 181-184, DOI: 10.1080/00031305.1998.10480559.
- Jacoby, William G. (2006): The Dot Plot: A Graphical Display for Labeled Quantitative Values, The Political Methodologist. Newsletter of the Political Methodology Section, American Political Science Association, Vol. 14, Number 1, pp. 6-14.
- Wilkinson, Leland (1999: Dot plots. The American Statistician, Vol. 53, No. 3, pp. 276-281.
© W. Ludwig-Mayerhofer, Stata Guide | Last update: 20 Apr 2025