Exploratory Data Analysis
Stem-and-leaf displays
Stem-and-leaf displays are a good way of looking at the shape of your data. With Stata, this is a good way only if you have a small data set (say, a few hundred cases at max). With large numbers of cases, you will encounter trouble, as Stata always displays each single case as one leaf and therefore may suppress a considerable number of individual values, indicating the number of values suppressed in parentheses. (I wouldn't be surprised if there were some user routine to deal with this problem; as yet I did not bother to search for such a device.)
The basic command is:
stem income
To compare groups, you may use by
or if
.
Apart from the "large n" problem mentioned above, Stata usually produces a meaningful display. However, you can influence the shape of your display using some options if for some reason or other you don't like the default outcome. Consider the following example
0* | 679
1* | 339
2* | 449
3* | 01269
4* | 44567
5* | 177789
6* | 246
This display has one line per leading digit (the digit forming the stem); put otherwise, it has a width of 10 (10 different digits may be displayed per stem. Therefore, by using either stem percfemale, lines(2)
or stem percfemale, width(5)
you will get the following display:
0. | 679
1* | 33
1. | 9
2* | 44
2. | 9
3* | 012
3. | 69
4* | 44
4. | 567
5* | 1
5. | 77789
6* | 24
6. | 6
Maybe you want to try stemplot
, created by Nicholas J. Cox, as an alternative display. To install it, type search stemplot
, scroll down to "Search of web resources" and click on the blue link provided. The basic command is just stemplot varname
; Stata's help will show you some more.
Box plots
Box plots (or, more formally, box-and-whisker-plots) are convenient ways of displaying basic features of the distribution of variables that both give a rough view of the distribution of the bulk of the data and display outlying data values. They can be used to check differences between groups; however, subtle differences may not be detected.
The basic command to create box plots for a metric variable, given a grouping variable (in our example, education), is as follows (the grouping variable may be omitted, of course, to produce a univariate plot):
graph box income, over(education)
Here is a more elaborate version of this command that uses a number of options to create the following graph:
graph box income, over(education) scheme(s1mono) plotregion(lstyle(none)) aspectratio (1.2) outergap(*3) intensity(0) medtype(cline) medline(lwidth(medium) lcolor(gs0) ) title("Household equivalent income by education") ytitle("Income") note("Source: XY data, own calculations")
Using graph hbox
, the plots will be displayed horizontally.
© W. Ludwig-Mayerhofer, Stata Guide | Last update: 20 Apr 2025