WLM Stata - Charts

Charts and Graphs

A number of graphic displays is available. By and by, this section will include some of those that seem most useful to me. Note the other entries in this section which contain important information about options for graphs. The final two sections are devoted to more complex graphs, where several elements are overlaid or where several graphs are combined in a singple plot.

Univariate graphs

Bar charts

Bar charts are typically used to display the frequencies or the percentages of categories of a variable.

Procedure graph bar

By default Stata deploys bar charts to show the mean values of variables. However, when used together with over bar charts for percentages (or frequencies, for that matter) of categories will be displayed.

graph bar, over(education)

will show the percentage of each category of variable "education". For frequencies, use

graph bar (count), over(education)

In some cases (many categories, long variable labels) a horizontal bar chart may be preferable. Use option horizontal.

As an aside, a drawback of the graph bar command is that the name of the variable is not displayed on the x axis. To do so, you have to use special commands for labeling axes, to wit, b1title("Any title you want") for vertical bar charts or l1title("Any title you want")in the case of horizontal charts.

Procedure histogram

See below (subsection on histograms) for using this procedure to display values of categorical variables.

Procedure catplot

catplot is an ado file which has to be downloaded. Use findit catplot for downloading this file to your computer.

catplot education, percent recast(bar)

will achieve the same result as the graph bar command outlined above. However, the name or the label of the variable displayed is used as a title for the x axis. Note that catplot can also be used to produce two-dimensional graphs.

Cumulative distribution functions

There are various methods to obtion cdfs of empirical variables; each has its merits and its drawbacks. I discovered Adrian Mander's cdfplot only recently; this should do in most cases. But I have not yet been able to explore it in all detail.

Using cdfplot

Procedure cdfplot is most apt for categorical (ordered) variables, but in principle, you may use it for any type of variable you like. It uses a step function to connect the values of the c.d.f.

First, install the ado file:

ssc install cdfplot

Then you may obtain a cdf plot:

cdfplot education

You may compare groups:

cdfplot education, by(sex)

Further possibilities (see help cdfplot): You may include, by way of comparison, the c.d.f. of a normally distributed variable, and there are some further options.

Using distplot

distplot is part of a package written by Nickolas J. Cox. Type findit distplot, look for the most recent entry on "Software update for distplot" and follow the link provided. Note that in earlier version the syntax was somewhat different.

distplot line income

With categorical variables, you may wish to the display the distribution as a step function, which can be achieved as follows:

distplot line education, connect(stairstep)

By default, the cdf is computed and displayed in terms of fractions. To obtain percentages, you have to use:

distplot line education, connect(stairstep) trscale(100 * @)

trscale obviously is for "transform scale". The "at" sign stands for the original values, i.e. the fractions; obviously you may use other transformations as well.

distplot may also be used to compare the cdf's of two or more groups as follows:

distplot line income, by(groupd) ytitle("cdf")

will display the cdf of income for the groups defined by variable "groupd" (of course further options may applied).

Kernel density estimators

kdensity trust

Only one variable can be specified. The look of the graph can be influenced by the "bwidth" option, as in:

kdensity trust, bwidth(.5)

The value to be chosen for the bandwidth, or smoothing parameter, depends on your data; you may have to play around a bit. Normally, the value chosen by Stata is a good starting point.

Another issue is the kernel to be used. By default, Stata uses the Epanechnikov chernel which from a theoretical viewpoint can be shown to minimize the error when using kdensity to estimate the underlying population distribution. Yet, sometimes other kernels may be used.

kdensity trust, kernel(biweight)

will use the biweight kernel. Some other well-known kernels that are available are epan2 (a simplified version of the Epanechnikov kernel), gaussian or triangle.

Histograms

histogram income

will display a histogram of, yes, "income".

You can obtain some control over your histogram with the following options:

histogram income, bin(20) start(0)

histogram income, width(500) start(0)

Option bin refers to the number of, well, bins, and start to the value at which the computation of the width of the bins will start (it will end automatically at or near the highest value encountered). As an alternative to the number of bins, you may indicate the width of each bin as in the second example.

Further options refer to the y-axis. By default, it will be drawn as density. You may also choose any of the following (the options are self-explaining):

histogram income, frac[tion]

histogram income, percent

histogram income, freq[uency]

Finally, you may add densities to the graph as follows:

histogram income, norm[al]

histogram income, kden[sity]

There are a few further options, particularly with respect to the display of the lines of the normal density or the kernel density estimate, which would lead us astray if explained at length here.

As bar charts cannot be used (at least not as far as I found out) to display the distribution of categorical variables, the histogram can be used for this purpose as well. With such variables, the basic command typically goes like this:

histogram trust_pol, discrete gap(20) fract

gap refers to the gap between the bars; as you may expect some trial and error may be required here depending on your data. Actually, the number in parentheses indicates that the width of the bars will be reduced by so-and-so-many per cent.

Twoway (bivariate) graphs

Many graph commands that fall into this category start with twoway, but some referring to graphs that also can be univariate (such as box plots) don't, and in the case of some others (such as scatter plots), twoway may be omitted.

Box plots (box-and-whisker plots)

graph box income, over(status)

will produce box plots of income for the various status groups. Obviously, over() is an option which can be omitted; in this case, the box plot will be produced for all cases (and thus will be a univariate plot).

graph box income1998 income2000 income2002 income2004, cw

will produce box plots of income in the sample over several years. Note the cw, or casewise (deletion), option used here which causes Stata to use only those cases with valid values for all variables.

graph box income, over(status) intensity(0) aspectratio(2) outergap(*3) medtype(cline) medline(lwidth(medium) lcolor(gs0) ) ytitle("Income") scheme(s1mono) title("Income by social groups in 2002") note("Source: XY data, own calculations")

uses a number of options to create a look that I like.

Scatterplots

scatter income tenure

This will plot income (y axis) against tenure (x axis).

There are many options specific to scatterplots. For instance,

scatter trustcourt trustpolit, ms(d)

will use small "diamonds" instead of circles to display the data. A list of symbols available can be found via help symbolstyle. There are also fifteen predefined marker styles that define combinations of colour, symbols and so on, see help markerstyle. You can combine global marker styles with more specific styles. For instance, marker style "p4" displays symbols in a colour that looks beige to me, and it uses circles for symbols. If you want to use all the settings of style p4, but with diamonds instead of circles, you may write

scatter trustcourt trustpolit, mstyle (p4) ms(d)

To fit a regression line, you have to use the extended version of the command which starts with graph twoway. It goes like this:

graph twoway (lfit income tenure) scatter income tenure

A jittered plot adds some random noise around each data point, with the value in parentheses referring to the size of the noise as percentage of the graphical area (some trial and error may be required here):

scatter trustcourt trustpolit, jitter(10)

Note that scatter does not automatically adapt the plot axes to the data range that is actually expanded by jittering; it will use the original range of the variables. Therefore you will have to extend the axes with the help of the xsc and ysc options.

Another way to deal with data points that would overlap in a simple scatterplot is a sunflower plot. Not surprisingly, it works like this:

sunflower trustcourt trustpolit

Note that Stata uses combinations of colours and petals to signal the density in a given are. This, and many other things, may again be influenced by various options.

A matrix of scatterplots

graph matrix income tenure educ prestige

will create a matrix in which the four variables mentioned are plot each against all others, with each variable appearing once on the x axis and once on the y axis. Options are available, e.g. to produce only the lower triangle of the matrix, to "jitter" the variables (add perturbations), and other stuff.

Twoway graphs for discrete (categorical) data

Percentages of a variable conditional on the values of a second variable may be displayed with the help of catplot (see above; an alternative not explained here is spineplot). It works like this:

catplot note sex, percent(sex) recast(bar)

To produce a stacked bar chart, you have to add the options asyvars stack.

Twoway histogram for discrete (categorical) data

This is something special which I do not recommend. But if you wish to try it, here we go.

histogram cww151b, discrete by(cww151a, total)

In this example, the distribution of cww151b is graphed for each category of cww151a. If the distribution of the 'by' variable (in this example, cww151a) is very uneven, you may wish to create a graph that displays the absolute frequencies and thus allows to judge the contribution of each category to the total:

twoway histogram cww151b, discrete freq by(cww151a, total)

Internet Guide to Stata

Print article