Scatterplots and Line Plots

This section introduces some elementary possibilities for displaying bivariate relationships of metric variables. Line plots are mostly used when one of these variables is time, in whatever shape, as lines typically are decoded as showing developments over time.

Please note: Most graphs in this entry have been created by using options scheme(s1mono) and plotregion(lstyle(none)). These are not repeated in the examples shown below. ylabel(,angle(0)) might also have been employed usefully, but hasn't.

The basic scatterplot

scatter income age, xtitle(" " "Age") ytitle("Pre-tax Income" " ")

This will plot income (y axis) against age (x axis). Note that this version is a shortcut for the full command graph twoway scatter income age. When overlaying twoway graphs, you will always have to start the command line either with graph twoway or at least with twoway.

Bar chart

There are many useful options for scatterplots. For instance, ms(d) (with ms as an abbreviation for msymbol) will use small "diamonds" instead of circles to display the data. A list of symbols available can be found via help symbolstyle. There are also fifteen predefined marker styles that define combinations of colour, symbols and so on, see help markerstyle. You can combine global marker styles with more specific styles. For instance, marker style "p4" displays symbols in a colour that looks beige to me, and it uses circles for symbols. If you want to use all the settings of style p4, but with diamonds instead of circles, you may use mstyle (p4) ms(d)See the entry on Lines, Symbols, etc. for more about this.

Note finally that you may add labels to the data points, which may be helpful, e.g., if your data refer to a small number of identifiable cases (such as countries, types of car, etc). A prerequisite is of course a variable containing the pertinent labels, which may either be a string variable or a numeric variable with labels. The option is simply mlabel(name-of-identifying-variable). For more detail, see the entry on changing elements of graphs.

Some variants of the scatterplot

Adding a fitted regression line

To add a (linear) regression line to the graph, you have to use the extended version of the command which starts with graph twoway. It goes like this:

graph twoway (lfit income age) scatter income tenure, legend(off))

Bar chart

Adding a locally fitted (and smoothed) regression line: lowess

To create a regression line that is following the data points more closely than a straight line, Stata offers the lowess command. This computes locally weighted regressions and smoothes the resulting line:

lowess income age, bw(.8) xtitle(" " "Age") ytitle("Pre-tax Income" " ") title(" ")

Scatter plot with lowess smoother

How close the lowess plot follows the data is defined by the bandwidth. The default value (.8) can be changed with the bwidth(#) option, where # is any number between a very small positive number (the exact size of which depends on the number of cases) and .99. Of course it wouldn't have been necessary to state the default explicitly; I did so only to draw your attention to this option.

running: An altnerative to lowess

As an alternative to lowess, you could use the running command. This has the advantage that confidence intervals are available. You can install this ado file from ssc, or perhaps you submit search search mrunning and follow the download link for the package; this will eventually install running together with its multivariate companion, mrunning.

running income age, ci

plus some options for axis titles (which you may have seen a couple of times by now) will yield the following. Among other things, you can see that one shouldn't take the increase to income when people approach 65 too seriosly; it's caused mainly by a single data point, and the lower ci remains more or less flat.

Scatter plot with smoother produce by procedure running

Jittering

If you have variables with a restricted range of (possibly discrete) values, a scatter plot will bear little information, as there will be many overlapping points. While you might wish to create a bar chart (perhaps with stacked bars) in such a situation, an alternative would be to add a little random noise. This is call jittering.

You can compare two charts here, one without jittering, the other with option jitter(8) added on the command line. The number in parentheses refers to the size of the noise as percentage of the area of the graph.

Scatter plot before applying jitter

Scatter plot after applying jitter

Another way to deal with data points that would overlap in a simple scatterplot is a sunflower plot, but I don't want to encourage that. If you're interested, you just might try sunflower with two variables.

Three (or more) variables

You may plot more than one y variable against the x variable, as in

scatter unempl atypjobs gdp

Here, variables "unempl" and "atypjobs" will be plotted against "gdp". By default, the two y variables will be distinguished by different colours, and a legend will be added to the plot.

When using options, you have to bear in mind that there are several variables. To select marker symbols, you will write, e.g.

scatter unempl atypjobs gdp, msymbol(d Oh)

The first symbol will refer to the data points that depict "unempl", the second to those that represent "atypjobs".


A matrix of scatterplots

graph matrix income tenure educ prestige

will create a matrix in which the four variables mentioned are plot each against all others, with each variable appearing once on the x axis and once on the y axis. Options are available, e.g. to produce only the lower triangle of the matrix, to "jitter" the variables (add perturbations), and other stuff.

Note particularly the diag[onal] option which changes the labels appearing in the diagonal of the matrix. By default, it's the variable names that appear here, and to change them, write, e.g.:

graph matrix income tenure educ prestige, diag("Income" "Tenure" "Education" "Prestige")


Line plots

Preamble

A typical use of line plots is to show developments over time. In fact, as lines suggest that something "goes up" or "goes down" (or, of course, "goes up and down", perhaps repeatedly), and such developments occur in time, line plots should be used in other circumstances with great caution.

The development under investigation may concern all kinds of entities, but typically if will refer to units such as "countries", "towns", "schools", "share values", etc.

Organisation of data

A line plot is a twoway graph, which implies that you need two variables. The one that (typically) represents "time" will be represented on the x axis, the other on the y axis. (In fact, a line plot is basically the same as a scatterplot, the only difference being the line that connects the data points.) Thus, a very simple data set (most real data will contain more rows and columns) may look like this:

year  unempl
2008    5.6
2009    7.1
2010    8.3
2011   10.1

Note that time series data constitute a special type of data in Stata. The may be organized in the same way, but you can specifically "set" such data (i.e., give Stata some information about the data, e.g., whether the time series consists of days, months, years or other stuff). For such data, there is a specific command for creating plots, tsline, which unfortunately is not covered here, as it's too far removed from my own research concerns.

Stata commands

A simple line plot for the data I have just displayed will go like this:

twoway line unempl year

There may (!) be circumstances in which you want to designate the data points by symbols (in addition to the lines connecting the data pointt). In this case, use keyword connected, as in

twoway connected unempl year

Typical options that accompany these commands could be (among others):

lcolor()   colour of the line
lpattern()   the line pattern, obviously
msymbol()   the symbol that represents the data values
msize()   the size of said symbol

Note again, that as with scatter plots, you may also plot more than one y variable on line plots, and you will have to take care of this when using options. For instance,

twoway connected unempl atypjobs year, msymbol(Oh D)

will plot variables "unempl" and "atypjobs" against year, and the former variable will be represented by hollow circles, whereas diamonds will be used for the latter variable.

© W. Ludwig-Mayerhofer, Stata Guide | Last update: 21 Apr 2025