Charts for Discrete Data

Bar charts are typically used to display the frequencies or the percentages of categories of a variable. The bars may be very slim (looking like spikes), even though this option is rarely used. Stata offers several ways to create bar charts, and these produce rather similar (though not identical) things if you are dealing with a single variable. Larger differences appear once you start dealing with two variables at the same time.

Before continuing, I want to direct your attention to Nicholas J. Cox' (2004) paper on graphs for categorical variables. You'll find interesting discussions and helpful examples.

Please note: Most graphs in this entry have been created by using, among other options, the following: scheme(s1mono), plotregion(lstyle(none)), and occasionally ylabel(,angle(0)). These options are not repeated in the examples shown below.

Procedure graph bar

By default Stata deploys bar charts to show the mean values of variables. These charts in my view are not recommendable (too much ink for a few numbers), and Cleveland dot plots should usually be preferred. However, when used together with over, bar charts for percentages (or frequencies, should you so desire) of categories will be displayed.

graph bar, over(nrooms) bar(1, bfcolor(white) blcolor(black)) b1title("Number of rooms")

is an example code that shows the distribution (in percentages) of the number of rooms of the respondents' flats or houses (from a small data set simulated after real data). Frequencies instead of percentages can be obtained with graph bar (count), which of course will not affect the look of the graph, only the labels on the vertical axis.

Bar chart

I have used several options in creating this graph; the following two may be noteworthy:
, b1title("Number of rooms") to display a description of the variable displayed (, xtitle("...") will not work, as there is no x axis (or so you will be told by Sta);
, bar(1, bfcolor(white) blcolor(black)) to address the bar filling colour and the bar line colour (by default, the lines delineating the bars are in grey).

In addition, option , gap(#) may be used to change the width of the bars (by influencing the gaps between those). This is a suboption for over(), i.e., it has to be placed within the parentheses surrounding the variable name, as in over(nrooms, gap(#)). You'll have to experiment a little, but note that something like gap(5000) will transform the bars into spikes. (However, if you really want spikes, you may use spikeplot instead of graph bar in the first place.)

In some cases (a larger number of categories, long variable labels) a horizontal bar chart may be preferable. Use option , horizontal to get this type of bar chart and , l1title("...") for the description of the variable shown. Alternatively, use command graph hhbar, omitting option , horizontal (you may also want , l1title("...") in this case).

Further options are described in the other entries of the "Graphs" section, particularly those about titles, legends and notes or about axes.

Bivariate data

graph bar, over(worries) over(education, gap(100)) asyvars percent

will create (vertical) bars showing the distribution of variable "worries" (e.g., worries about the respondent's economic situation) conditional on variable "education". Be sure to include option asyvars. Adding option stack unsurprisingly will stack the bars on each other. Using graph hbar instead of graph bar, or option horizontal, will display the bars horizontally. Suboption , gap(100)) is used in this example to render the bars slightly slimmer by increasing the gaps between them; it can be omitted, or course.

Procedure catplot

catplot is an ado file which has to be downloaded. Use findit catplot to install this file on your computer.

catplot nrooms, percent recast(bar)

will achieve the same result as the graph bar command outlined above. However, the name or the label of the variable displayed is used as a title for the x axis.

catplot can also be used to produce two-dimensional graphs. However, with graph bar, the bars of the conditional distributions will add up 100 per cent for each category of the conditioning variable. In contrast, with command catplot, by default all bars taken together will add up to 100 percent. To change this and obtain the same display as with graph bar, you have to name the conditioning variable with the percent option, as in percent(education). Also, the default for catplot are horizontal bars; this can be changed with option vertical or with recast(bar) as shown above. Otherwise, the command line can be built exactly as for for graph bar, including option stack for a stacked bar chart.

Procedure spineplot

This command, available after ssc install spineplot can create a plot like this:

Mosaic plot

which was achieved with

spineplot sorgwirt bildgr, percent bar1(bcolor(black*0.1)) bar2(bcolor(black*0.25)) bar3(bcolor(black*0.4))

Such a plot is called a mosaic plot, aka as Marimekko plot (or so I've been told). As you can see, it differs from the classical stacked bar chart by representing, in addition, the distribution of the conditioning variable: As the middle category of variable "education" holds more than half of the cases, the middle column is considerably wider than the other to columns. Put differently, each tile corresponds to the overall proportion of the cases with the respective combination of features. For instance, the tile in the top left corner corresponds to roughly .27 x . 18 = .05 (or 5 per cent) of all cases.

For more information on this procedure, see Cox (2008).

Procedure tabplot

Another procedure you have to install separately. Use search tabplot and download the latest version. This procedure is of interest particularly for bivariate or trivariate relationships, but it has something to offer for univariate display as well.

The main difference to similar procedures is that there is no scale against which to measure the height (or the length) of the bars. Invoking tabplot varname will only produce bars of different height, which of course may be all you want (the relative height of the bars being what people will be most interested in). If you want more, i.e. numbers associated with the bars, they can be displayed beneath the bars, as in the following:

tabplot nrooms, percent showval( , format(%2.0f)) xtitle(" " "Number of rooms") subtitle(" ")

Bar chart

Let me highlight two options: First, showval( , format(%2.0f)) will display the percentages underneath the bars, with the format part suppressing the decimal places (you can omit the parentheses after showval to obtain the default, i.e. numbers with one decimal place). Second, the subtitle option as used here suppresses the word "percent" which otherwise would appear on the top left of the chart. Of course, you may just as well choose a different subtitle by putting it between the double quotes, or omit "subtitle" entirely to display the default.

What if you want a different look of the bars? To achieve this is a bit tricky: You have to use option , sep(varname) which tells Stata that each bar of the variable under consideration is to be treated as a separate entity. Now, you may add option barall() to change the look of all the bars (which is what you will normally want to do with a univariate diagram). For example, barall(bcolor(white) blcolor(black)) will yield bars like those in the graph bar example further above; I use this in the bivariate example that follows. If you wish, you may address each single bar by bar1(), bar2() ... up to bar20(). (In case you want to change the look of all the bars and at the same time highlight a particular bar, you might use first a barall() clause followed by bar#(), with # replaced by the number of the particular bar. For instance, bar6(bcolor(red)) would show the sixth bar in red.

Bivariate data

With tabplot, a chart analogous to the mosaic plot shown above (created with spineplot is created, but with a different look.

tabplot worries education, sep(worries) barall(bcolor(white) blcolor(black)) showval(offset(0.08) format(%2.0f)) xtitle(" " "Education") subtitle("Frequency")

Here, I have chosen to display frequencies; however, the numbers are virtually the same, as there are 99 cases, and by adding option percent we would get the same display with the numbers representing percentages.

A plot created by tabplot

For a graph analogous to the stacked bar chart, you have to add percent(education) instead of percent only. The result will look like this, with the percentages adding up to 100 for each category of education:

A plot created by tabplot

For more information on this procedure, see section 4.4 Cox (2004).

Histograms for categorical (discrete) variables

Histograms are typically used for metric variables (see elsewhere), but a command such as

histogram nrooms, discrete gap(20) percent

will create a chart that is quite similar to a bar chart.

gap refers to the gap between the bars; as you may expect some trial and error may be required here depending on your data. Actually, the number in parentheses indicates that the width of the bars will be reduced by so-and-so-many per cent.

While the start() option (described in the entry on histogram for metric variables) has no effect here as far as I can see, the width() option may be used for grouping categories. In my view, however, this would go against the 'logic' of discrete charts.

Top of page

Reference

  • Cox, Nicholas J. (2004): Speaking Stata: Graphing categorical and compositional data, The Stata Journal, 4 (2), pp. 190-215.
  • Cox, Nicholas J. (2008): Speaking Stata: Spineplots and their kin, The Stata Journal, 8 (1), pp. 105–121.

© W. Ludwig-Mayerhofer, Stata Guide | Last update: 21 Apr 2025