Data and Distributions

The graphs in this entry are meant mainly for the display of metric variables

Many graphs have been developed that attempt to give an approximate view of the distribution of a variable (the box-and-whisker plot, shown elsewhere, actually should be counted among these). I start with a plot that tries to show all single data values and proceed to plots that present more abstracted views.

Please note: Most graphs in this entry have been created by using, among other options, the following: ylabel(,angle(0)), scheme(s1mono) and plotregion(lstyle(none)). These are not repeated in the examples shown below.

Strip plots

Strip plots (or strip charts) display each single data value as a dot. However, in contrast to dot plots, the dots are not stacked to resemble a histogram; they are (typically) placed all besides (or above/beneath) each other, on a single "strip" as it were.

The strip plot ado file must be downloaded to your system if you haven't done so before:

ssc install stripplot

The following plot was created by

stripplot equivinc, vertical ytitle("Household equivalent income")

Strip plot

Bivariate strip plots

A bivariate strip plot can by created as follows, e.g.:

stripplot aeqeink, over(bildgr) vertical ytitle("Household equivalent income") xtitle(" " "Education") aspectratio(1.3)

Strip plot

Top of page

Histograms

Histograms are common, but they are also treacherous (as will be demonstrated shortly). Most professional statisticians prefer other graphs in most circumstances, but sometimes a histogram can be useful.

histogram equivinc, percent fcolor(gs12) lcolor(white) ylabel(,angle(0))

(plus some further options) was used to create a distribution of household equivalent incomes in Germany (older data in Deutschmark; only an approximation of real data):

Histogram

However, a histogram of the same data could also look like this:

A histogram of the same data, with different look

The first histogram used Stata's defaults, the second one was based on indications of the intervals into which values are grouped, plus a starting point for the first interval, which can be obtained as follows:

histogram equivinc, percent width(500) start(0)

As an alternative to width( ), you may use bin( ), indicating the number of, well, bins, i.e., intervals or groups. Be it as it may, the "binning" of data points into groups defined by (more or less) arbitrarily chosen intervals is a drawback of the histogram.

Note that instead of percentages, you might wish to show probability density values (the default; i.e., you'll just drop , percent), fractions (use option , frac[tion]) or frequencies (, freq[uency]). This will only change the values displayed on the vertical axis, not the shape of the histogram.

Finally, you can use an option to add a density to the graph, in particular , norm[al] or , kden[sity].

A little known alternative is a histogram whose bars represent chunks of the data with equal probability, which of course will result in bars of different width in virtually all (real) cases. Procedure eqprhistogram can be downloaded with the command

ssc install eqprhistogram

and a command like the following can be used to create the histogram:

eqprhistogram equivinc, bin(10)

A histogram of the same data, with different look

Option bin(#) is of course not required, but it is useful here to determine the proportion of data that are represented by the bars.

Cox (2004) discusses some additional varieties of the histogram.

Top of page

Dot plots

"Dot plot" can mean several things, but in statistics it's most often used to denote sort of a histogram that uses, well, (stacked) dots instead of bars. The version implemented in Stata displays the (grouped, or "binned") values of the variable under investigation on the vertical axis and the frequencies on the horizontal axis.

Graphs of this type are created by the dotplot command. In contrast, graph dot creates what occasionally is referred to as Cleveland dot plots, described in a different entry.

As an aside, note that not everybody adheres to this conceptual distinction. For instance, a very good text on Cleveland dot plots (Jacoby 2006) uses only the simpler term "dot plot". Another text (Wilkinson 1999) about "dot plots" refers to the kind of plot described here (however, Wilkinson argues that the "real" dot plot is one that foregoes binning and suggests that dot plots that group the data be called "histodot plots").

The following example shows a dotplot of household equivalent income of 100 cases (data are simulated after real data); data were grouped into 20 bins (some of which are empty). Think about it as a histogram that is turned clockwise by 90 degrees. Note, however, that the lower values are at the bottom; that is, if you could turn the graph counter-clockwise (which unfortunately you cannot do in Stata afaik) the histogram would start with the highest values on the left hand side.

Univariate dot plot

The command to create this plot is:

dotplot equivinc, ny(20) ytitle("Househould equivalent income")

ny(20) tells Stata to build 20 groups or "bins" from the variable under investigation (the default is 35). The remaining options change the look of the graph.

Another interesting option could be center which centers the dots instead of aligning them on the left.

Note that command stripplot can be used to produce dot plots as well. For instance, the following command creates a dot plot that is analogous to the one just shown:

stripplot equivinc, stack width(500) vertical.

Bivariate dot plots

Like most graphs, the dot plot may be used to compare several groups, as in:

dotplot equivinc, ny(20) over(education) xtitle(" " "Education")

Bivariate dot plot

Top of page

Kernel density estimators

Only one variable can be specified (even though formally Stata defines this as a twoway graph). The density estimated and displayed may be influenced by the "bwidth" option, as in:

kdensity equivinc, bwidth(500)

Dot plot

The value to be chosen for the bandwidth, or smoothing parameter, depends on your data; you may have to play around a bit. Normally, the value chosen by Stata is a good starting point (actually it's quite near to 500 in my example).

Another issue is the kernel to be used. By default, Stata uses the Epanechnikov kernel which from a theoretical viewpoint can be shown to minimize the error when using kdensity to estimate the underlying population distribution. Yet, sometimes other kernels may be used.

kdensity equivinc, kernel(biweight)

will use the biweight kernel. Some other well-known kernels that are available are epan2 (a simplified version of the Epanechnikov kernel), gaussian or triangle.

Top of page

Violin plots

A violin plot combines a box-and-whisker plot with a density estimator (for details see Hintze & Nelson 1998).

The violin plot is not part of the package you obtain from Stata Corp.; and ado file must be downloaded to your system:

ssc install vioplot

The following plot was created by

vioplot equivinc

plus some options.

Univariate violin plot

Bivariate violin plots

Like box blots, violin plots can be important for comparing groups, as in this example:

vioplot equivinc, over(education) ytitle("Household equivalent income") xtitle(" " "Education")

Bivariate violin plot

Top of page

References

Cox, Nicholas J. (2004): Speaking Stata: Graphing distributions, The Stata Journal, 4 (1), pp. 66–88.
Hintze, Jerry L./Nelson, Ray D. (1998) Violin Plots: A Box Plot-Density Trace Synergism, The American Statistician, 52:2, 181-184, DOI: 10.1080/00031305.1998.10480559.
Jacoby, William G. (2006): The Dot Plot: A Graphical Display for Labeled Quantitative Values, The Political Methodologist. Newsletter of the Political Methodology Section, American Political Science Association, Vol. 14, Number 1, pp. 6-14.
Wilkinson, Leland (1999: Dot plots. The American Statistician, Vol. 53, No. 3, pp. 276-281.