Twoway (Bivariate) Charts

This section introduces some elementary possibilities for displaying bivariate relationships. Many graph commands that fall into this category start with twoway, but some referring to graphs that also can be used for univariate display (such as box plots) don't, and in the case of some others (such as scatter plots), twoway may be omitted.

Note the other entries in this section which contain important information about options for graphs. The final two entries are devoted to more complex graphs, where several elements are overlaid or where several graphs are combined in a singple plot.


Box plots (box-and-whisker plots)

Box plots were already described in the "univariate charts" entry, but actually they are mainly used to compare distributions of two or more groups, as in:

graph box income, over(status)

Here is a more elaborate version that uses a number of options to create a look that I like.

graph box income, over(status) scheme(s1mono) intensity(0) aspectratio(2)  ///

   outergap(*3)  medtype(cline)  medline(lwidth(medium) lcolor(gs0) )      ///

   title("Income by social groups in 2002") ytitle("Income")               ///

   note("Source: XY data, own calculations")

Dot plots

Even more than in the case of box plots, the real use of dot plots lies in comparing the values several groups, as in:

graph dot income, over(status)

Again, many options are available, among which one stands out in particular: It is , sort, which is a suboption to , over. Thus,

graph dot income, over(status, sort(income))

will show the category with the lowest income on top of the graph and will proceed to the higher categories. Adding desc[ending] will reverse the order. Note that the graph need not sorted by any of the variables that appear in the graph; it may as well be sorted by some other variable.


Scatterplots

Some basics

scatter income tenure

This will plot income (y axis) against tenure (x axis). Note that this version is a shortcut for the full command graph twoway scatter income tenure. When overlaying twoway graphs, you will always have to start the command line either with graph twoway or at least with twoway.

There are many options specific to scatterplots. For instance,

scatter trustcourt trustpolit, ms(d)

(with ms as an abbreviation for msymbol) will use small "diamonds" instead of circles to display the data. A list of symbols available can be found via help symbolstyle. There are also fifteen predefined marker styles that define combinations of colour, symbols and so on, see help markerstyle. You can combine global marker styles with more specific styles. For instance, marker style "p4" displays symbols in a colour that looks beige to me, and it uses circles for symbols. If you want to use all the settings of style p4, but with diamonds instead of circles, you may write

scatter trustcourt trustpolit, mstyle (p4) ms(d)

See the entry on Lines, Symbols, etc. for more about this.

Note finally that you may add labels the data points, which is helpful, e.g., if your data refer to a small number of cases (such as countries, types of car, etc). A prerequisite if of course that there is a variable with the pertinant labels. this may either be a string variable or a numeric variable with labels. The option is simply mlabel(varname). For more detail, see the entry on changing elements of graphs.

Some variants

To add a (linear) regression line to the graph, you have to use the extended version of the command which starts with graph twoway. It goes like this:

graph twoway (lfit income tenure) scatter income tenure

To display a smoothed regression line, use the lowess command instead:

lowess income tenure

How close the lowess plot follows the data is defined by the bandwidth. The default value (.8) can be changed with the bwidth(#) option, where # is any number between .1 and .99. The former will yield a graph that zigzags from one data point to the next. Actually, you can use other numbers as well, but these will default to .8 (with the exception of number larger than 0 and smaller than .1, which will result in no regression line at all).

A jittered plot adds some random noise around each data point, with the value in parentheses referring to the size of the noise as percentage of the graphical area (some trial and error may be required here):

scatter trustcourt trustpolit, jitter(10)

Note that scatter does not automatically adapt the plot axes to the data range that is actually expanded by jittering; it will use the original range of the variables. Therefore you will have to extend the axes with the help of the xsc and ysc options.

Another way to deal with data points that would overlap in a simple scatterplot is a sunflower plot. Not surprisingly, it works like this:

sunflower trustcourt trustpolit

Note that Stata uses combinations of colours and petals to signal the density in a given are. This, and many other things, may again be influenced by various options.

Three (or more) variables

You may plot more than one y variable against the x variable, as in

scatter unempl atypjobs gdp

Here, variables "unempl" and "atypjobs" will be plotted against "gdp". By default, the two y variables will be distinguished by different colours, and a legend will be added to the plot.

When using options, you have to bear in mind that there are several variables. To select marker symbols, you will write, e.g.

scatter unempl atypjobs gdp, msymbol(d Oh)

The first symbol will refer to the data points that depict "unempl", the second to those that represent "atypjobs".


A matrix of scatterplots

graph matrix income tenure educ prestige

will create a matrix in which the four variables mentioned are plot each against all others, with each variable appearing once on the x axis and once on the y axis. Options are available, e.g. to produce only the lower triangle of the matrix, to "jitter" the variables (add perturbations), and other stuff.

Note particularly the diag[onal] option which changes the labels appearing in the diagonal of the matrix. By default, it's the variable names that appear here, and to change them, write, e.g.:

graph matrix income tenure educ prestige, diag("Income" "Tenure" "Education" "Prestige")


Line plots

Preamble

One typical use of line plots is to show developments over time. In fact, as lines suggest that something "goes up" or "goes down" (or, of course, "goes up and down", perhaps repeatedly), and such developments occur in time, line plots should be used in other circumstances with great caution.

The development under investigation may concern all kinds of entities, but typically if will refer to units such as "countries", "towns", "schools", etc.

Organisation of data

A line plot is a twoway graph, which implies that you need two variables. The one that (typically) represents "time" will be represented on the x axis, the other on the y axis. (In fact, a line plot is basically the same as a scatterplot, the only difference being the line that connects the data points.) Thus, a very simple data set (most real data will contain more rows and columns) may look like this:

year  unempl
2008    5.6
2009    7.1
2010    8.3
2011   10.1

Note that time series data constitute a special type of data in Stata. The may be organized in the same way, but you can specifically "set" such data (i.e., give Stata some information about the data, e.g., whether the time series consists of days, months, years or other stuff). For such data, there is a specific command for creating plots, tsline, which will not be covered here.

Stata commands

A simple line plot for the data I have just displayed will go like this:

twoway line unempl year

There may (!) be circumstances in which you want to designate the data points by symbols (in addition to the lines connecting the data poinst). In this case, use keyword connected, as in

twoway connected unempl year

Typical options that accompany these commands could be (among others):

lcolor()   colour of the line
lpattern()   the line pattern, obviously
msymbol()   the symbol that represents the data values
msize()   the size of said symbol

Note again, that as with scatter plots, you may also plot more than one y variable on line plots, and you will have to take care of this when using options. For instance,

twoway connected unempl atypjobs year, msymbol(Oh D)

will plot variables "unempl" and "atypjobs" against year, and the former variable will be represented by hollow circles, whereas diamonds will be used for the latter variable.


Twoway graphs for discrete (categorical) data

Catplot and spineplot

Percentages of a variable conditional on the values of a second (or even a third) variable may be displayed with the help of catplot (an ado file; see the entry on univariate charts). An alternative is spineplot, which I will perhaps explain at a later stage. A twoway graph with catplot works like this:

catplot note sex, percent(sex) recast(bar)

To produce a stacked bar chart, you have to add the options asyvars stack.

Particularly if percentages are conditioned on more than one variable, the labels may be too large (in relation to their number) and will overlap. In this case, try the following option: var2opts(label(labsize(small))) (or even vsmall).

Twoway histogram for discrete (categorical) data

This is something very specific, and I do not recommend it. But if you wish to try it, here we go.

histogram cww151b, discrete by(cww151a, total)

In this example, the distribution of cww151b is graphed for each category of cww151a. If the distribution of the 'by' variable (in this example, cww151a) is very uneven, you may wish to create a graph that displays the absolute frequencies and thus allows to judge the contribution of each category to the total:

twoway histogram cww151b, discrete freq by(cww151a, total)

© W. Ludwig-Mayerhofer, Stata Guide | Last update: 11 Jun 2019