Charts for Discrete Data

Bar charts are typically used to display the frequencies or the percentages of categories of a variable. The bars may be very slim (looking like spikes), even though this option is rarely used. Stata offers several ways to create bar charts, and these produce rather similar (though not identical) things if you are dealing with a single variable. Larger differences appear once you start dealing with two variables at the same time.

Before continuing, I want to direct your attention to Nicholas J. Cox' (2004) paper on graphs for categorical variables. You'll find interesting discussions and helpful examples.

Please note: Most graphs in this entry have been created by using, among other options, the following: scheme(s1mono), plotregion(lstyle(none)), and occasionally ylabel(,angle(0)). These options are not repeated in the examples shown below.

Procedures graph bar and catplot

graph bar and catplot can both be used to create bar charts; they differ concerning the layout or design of the charts. First, a few words of introduction:

By default Stata bar charts display the mean values of variables. These charts in my view are not recommendable (too much ink for a few numbers), and Cleveland dot plots should usually be preferred. However, when used together with over, bar charts for percentages (or frequencies, should you so desire) of categories will be displayed.

catplot was created for the very purpose of displaying frequencies (counts) and related statistics such as percentages or fractions. It is an ado file; use findit catplot to install this file on your computer.

Univariate charts

graph bar, over(nrooms) bar(1, bfcolor(white) blcolor(black)) b1title("Number of rooms")

is an example code that shows the distribution (in percentages) of the number of rooms of the respondents' flats or houses (from a small data set simulated after real data). Frequencies instead of percentages can be obtained with graph bar (count), which of course will not affect the look of the graph, only the labels on the vertical axis.

Bar chart

You will get exactly the same plot if you write catplot, percent vertical instead of graph bar,. In other words, thus far the main differences between graph bar and catplot are (a) that the plots created by the latter are horizontal by default and (b) catplot by default shows frequencies (counts).

I have used several options (all of which can be used with catplot as well); the following may be noteworthy:
, b1title("Number of rooms") gives a name for the variable displayed. "b1" is just a short cut for "bottom", or "below", "1st" (title) (in other words, you could add something else below the plot with "b2title"). This is because , xtitle("...") will not work, as there is no x axis (or so you will be told by Stata).
, bar(1, bfcolor(white) blcolor(black)) to address the bar filling colour and the bar line colour (by default, the lines delineating the bars are in grey).

In addition, option , gap(#) may be used to change the width of the bars (by influencing the gaps between those). This is a suboption for over(), i.e., it has to be placed within the parentheses surrounding the variable name, as in over(nrooms, gap(#)). You'll have to experiment a little, but note that something like gap(5000) will transform the bars into spikes. (However, if you really want spikes, you may use spikeplot instead of graph bar in the first place.)

Further options are described in the other entries of the "Graphs" section, particularly those about titles, legends and notes or about axes.

In some cases (a larger number of categories, long variable labels) a horizontal bar chart may be preferable. As indicated, catplot can be deployed for this purpose. Alternatively, you may use graph bar, horizontal, with , l1title("...") for the description of the variable shown. Finally, you could use command graph hbar (without option , horizontal; you may want , l1title("...") in this case, however).

Bivariate data

The main difference between graph bar and catplot can be found here. Catplot (or rather, its author) does not like legends, in contrast to graph bar. Catplot is also not in favour of bars with different colours (which is related to the legends question). So, by default graphs look somewhat different, depending on which command you use.

The following examples use a variable called "worries", which measures how much respondents worry about their economic situation, conditional on variable "education". We start with ....

graph hbar, over(worries) over(education) asyvars percent l1title(" " "Education") ytitle("Percent")

Bar chart

I have used graph hbar for comparison with catplot. Be sure to include option asyvars.

And here is the result of

catplot , over(worries) over() percent(education) l1title(" " "Education") bar(1, bfcolor(white) blcolor(black))

Bar chart

Note that I've written percent(em>education)! Writing only "percent" will yield a different graph -- one in which all bars (i.e. all categories of "worries" over all categories of "education") add up to 100 percent. The same result can be obtained with "graph hbar" if you omit "percent" from the list of options.

Finally note that you can add option stack both to "graph (h)bar" and to "catplot". Unsurprisingly, this will stack the bars on each other. In this case, legends cannot be avoided and the graphs will look similar no matter which command you use.

Trivariate graphs

You may add a third variable with another "over(...)" option. Regardless of whether "graph hbar" or "catplot" is used, the result will look quite similar, and similarly convoluted. Alternatively, the third variable might be added via option by(...), which will result in two bivariate graphs placed side by side.

2010 version of catplot

For a few years from 2010 onward the code for "catplot" was somewhat different. In case you are using this older version, a few words might be helpful.

As an example,

catplot nrooms, percent recast(bar)

will produce something similar to the univariate bar graph shown above (without "recast(bar)" the bars will appear horizontally). The name or the label of the variable will be displayed automatically.

For bivariate plots, a second variable can be added to the list of variables. The options work similar to what I have described above; for instance, the conditioning variable has to be added in parentheses to "percent" if you want the percentages to add up to 100 for each category of the conditioning variable.

Procedure spineplot

This command, available after ssc install spineplot can create a plot like this:

Mosaic plot

which was achieved with

spineplot worries education, percent bar1(bcolor(black*0.1)) bar2(bcolor(black*0.25)) bar3(bcolor(black*0.4))

Such a plot is called a mosaic plot, aka as Marimekko plot (or so I've been told). As you can see, it differs from the classical stacked bar chart by representing, in addition, the distribution of the conditioning variable: As the middle category of variable "education" holds more than half of the cases, the middle column is considerably wider than the other to columns. Put differently, each tile corresponds to the overall proportion of the cases with the respective combination of features. For instance, the tile in the top left corner corresponds to roughly .27 x . 18 = .05 (or 5 per cent) of all cases.

For more information on this procedure, see Cox (2008).

Procedure tabplot

Another procedure you have to install separately. Use search tabplot and download the latest version. This procedure is of interest particularly for bivariate or trivariate relationships, but it has something to offer for univariate display as well.

The main difference to similar procedures is that there is no scale against which to measure the height (or the length) of the bars. Invoking tabplot varname will only produce bars of different height, which of course may be all you want (the relative height of the bars being what people will be most interested in). If you want more, i.e. numbers associated with the bars, they can be displayed beneath the bars, as in the following:

tabplot nrooms, percent showval( , format(%2.0f)) xtitle(" " "Number of rooms") subtitle(" ")

Bar chart

Let me highlight two options: First, showval( , format(%2.0f)) will display the percentages underneath the bars, with the format part suppressing the decimal places (you can omit the parentheses after showval to obtain the default, i.e. numbers with one decimal place). Second, the subtitle option as used here suppresses the word "percent" which otherwise would appear on the top left of the chart. Of course, you may just as well choose a different subtitle by putting it between the double quotes, or omit "subtitle" entirely to display the default.

What if you want a different look of the bars? To achieve this is a bit tricky: You have to use option , sep(varname) which tells Stata that each bar of the variable under consideration is to be treated as a separate entity. Now, you may add option barall() to change the look of all the bars (which is what you will normally want to do with a univariate diagram). For example, barall(bcolor(white) blcolor(black)) will yield bars like those in the graph bar example further above; I use this in the bivariate example that follows. If you wish, you may address each single bar by bar1(), bar2() ... up to bar20(). (In case you want to change the look of all the bars and at the same time highlight a particular bar, you might use first a barall() clause followed by bar#(), with # replaced by the number of the particular bar. For instance, bar6(bcolor(red)) would show the sixth bar in red.

Bivariate data

The chart created will be analogous to the mosaic plot shown above (see procedure spineplot), but with a different look.

tabplot worries education, sep(worries) barall(bcolor(white) blcolor(black)) showval(offset(0.08) format(%2.0f)) xtitle(" " "Education") subtitle("Frequency")

Here, I have chosen to display frequencies; however, the numbers are virtually the same, as there are 99 cases, and by adding option percent we would get the same result with the numbers representing percentages.

A plot created by tabplot

For a graph analogous to the stacked bar chart, you have to add percent(education) instead of percent only. The result will look like this, with the percentages adding up to 100 for each category of education:

A plot created by tabplot

For more information on this procedure, see section 4.4 in Cox (2004).

Histograms for categorical (discrete) variables

Histograms are typically used for metric variables (see elsewhere), but a command such as

histogram nrooms, discrete gap(20) percent

will create a chart that is quite similar to a bar chart.

gap refers to the gap between the bars; as you may expect some trial and error may be required here depending on your data. Actually, the number in parentheses indicates that the width of the bars will be reduced by so-and-so-many per cent.

While the start() option (described in the entry on histogram for metric variables) has no effect here as far as I can see, the width() option may be used for grouping categories. In my view, however, this would go against the 'logic' of discrete charts.

Top of page

Reference

Cox, Nicholas J. (2004): Speaking Stata: Graphing categorical and compositional data, The Stata Journal, 4 (2), pp. 190-215.
Cox, Nicholas J. (2008): Speaking Stata: Spineplots and their kin, The Stata Journal, 8 (1), pp. 105–121.