Exploratory Data Analysis

Stem-and-leaf displays

Stem-and-leaf displays are a good way of looking at the shape of your data. With Stata, this is a good way only if you have a small data set (say, a few hundred cases at max). With large numbers of cases, you will encounter trouble, as Stata always displays each single case as one leaf and therefore may suppress a considerable number of individual values, indicating the number of values suppressed in parentheses. (I wouldn't be surprised if there were some user routine to deal with this problem; as yet I did not bother to search for such a device.)

The basic command is:

stem income

To compare groups, you may use by or if.

Apart from the "large n" problem mentioned above, Stata usually produces a meaningful display. However, you can influence the shape of your display using some options if for some reason or other you don't like the default outcome. Consider the following example

0* | 679
1* | 339
2* | 449
3* | 01269
4* | 44567
5* | 177789
6* | 246

This display has one line per leading digit (the digit forming the stem); put otherwise, it has a width of 10 (10 different digits may be displayed per stem. Therefore, by using either stem percfemale, lines(2) or stem percfemale, width(5) you will get the following display:

0. | 679
1* | 33
1. | 9
2* | 44
2. | 9
3* | 012
3. | 69
4* | 44
4. | 567
5* | 1
5. | 77789
6* | 24
6. | 6

Maybe you want to try stemplot, created by Nicholas J. Cox, as an alternative display. To install it, type search stemplot, scroll down to "Search of web resources" and click on the blue link provided. The basic command is just stemplot varname; Stata's help will show you some more.


Box plots

Box plots (or, more formally, box-and-whisker-plots) are convenient ways of displaying basic features of the distribution of variables that both give a rough view of the distribution of the bulk of the data and display outlying data values. They can be used to check differences between groups; however, subtle differences may not be detected.

The basic command to create box plots for a metric variable, given a grouping variable (in our example, education), is as follows (the grouping variable may be omitted, of course, to produce a univariate plot):

graph box income, over(education)

Here is a more elaborate version of this command that uses a number of options to create the following graph:

graph box income, over(education) scheme(s1mono) plotregion(lstyle(none)) aspectratio (1.2) outergap(*3) intensity(0) medtype(cline) medline(lwidth(medium) lcolor(gs0) ) title("Household equivalent income by education") ytitle("Income") note("Source: XY data, own calculations")

Bivarate box plot

Using graph hbox, the plots will be displayed horizontally.

© W. Ludwig-Mayerhofer, Stata Guide | Last update: 20 Apr 2025