Winsorizing and Trimming

It is generally known that the mean (typically we have the arithmetic mean in mind) may be heavily influenced by outlying values. Trimming and winsorizing are procedures that may help to assess the magnitude of such influences and to possibly arrive at measures that are subject to such influences to a lesser degree.

Trimming means discarding values at the tails of the distribution. That is, a percentage of the lowest and (normally an equal percentage of) the highest values of a variable are removed from the data when computing the mean. For instance, you may remove 5 per cent of the lowest and 5 per cent of the highest values.

Winsorizing works differently: The values at the tails of the distribution are not removed, but are recoded to less extreme values. To resume the earlier example, the 5 per cent of the lowest values would be recoded to the value of the 5th percentile and the 5 per cent of the highest values would be recoded to the value of the 95th percentile.

Both techniques are not part and parcel of Stata's standard distribution. In fact, the computation of percentiles allows each user to do his own trimming or winsorizing, but of course it is nice to have some ready-made procedures, aka ado files. We have to be grateful to the tireless Nicholas Cox who wrote most of the pertinent packages.

Note that actually only winsorizing works like a data transformation procedure – it changes the values of a variable (by default creating a new variable which is added to the dataset), on which we may work thereafter. Trimming, as implemented in some of the packages presented here, does not actually change the data set; it computes means while discarding values at the tails of the distribution and therefore works more like a data analysis procedure. However, due to the similarity of the procedures I present both in this section.

Trimming

The packages I am going to describe are called trimmean and trimplot. They can be downloaded via

ssc install trimmean
ssc install trimplot

Both procedures do not change or create any data; they just compute means under different conditions of trimming and display these in a table or a plot. Note that procedure winsor2 described below will create trimmed variables that are added to the data set.

trimmean

This procedure basically works like this: You inform Stata about percentages or (absolute) numbers of cases to be removed, and Stata reports the means computed based on the trimmed values. You may indicate single values, several values (value lists) or starting and ending points with an increment. Thus,

trimmean income, percent(0(5)50)

will remove 0, 5, 10 .... 50 per cent of the cases on each tail of the distribution and show the means computed on each of the trimmed samples. Note that removing 50 per cent on each tail will not be done literally; rather, the value 'in the middle', i.e. the median, will be retained. Likewise,

trimmean income, number(100 200 300 500)

will successively remove 100, 200, 300 and finally 500 cases on each tail of the distribution and compute the means.

The following table was produced with the help of the command shown above with the percent option. The variable investigated is very skewed; more than 50 per cent of the values are exactly 1, the 75th percentile is 3, the 90th percentile is 13, and the maximum is almost 400. Therefore, the untrimmed mean is much higher than any trimmed mean.

  +-------------------------------+
  | percent      #   trimmed mean |
  |-------------------------------|
  |       0   5062       6.971355 |
  |       5   4556       3.059482 |
  |      10   4050       2.101235 |
  |      15   3544        1.62754 |
  |      20   3038       1.344306 |
  |      25   2532       1.153239 |
  |      30   2026       1.016288 |
  |      35   1520              1 |
  |      40   1014              1 |
  |      45    508              1 |
  |      50      2              1 |
  +-------------------------------+

Some options are available, among which ci adds standard errors and confidence intervals to the means.

trimplot

This procedure successively eliminates cases at both tails and plots the resulting means (y axis) against the respective number of cases removed, called 'depth' in the graph (x axis). The simplest version is

trimplot income

Options include by() to plot the means for subgroups defined by a variable that is indicated within the parentheses, or p, which will request Stata to display the percentage of removed cases on the x axis instead of the absolute number of cases.

Winsorizing

In contrast to the trimming procedures described above, winsorizing transforms your current working dataset by creating new ("winsorized") variables that can be used for further analysis.

The winsor ado file was written by Nicholas J. Cox; Yujun Lian seemingly used the code and expanded the file to create winsor2 (see https://www.statalist.org/forums/forum/general-stata-discussion/general/1430830-winsor1-vs-winsor-2). The syntaxes of both ados differ slightly, and winsor2 can do some things winsor cannot (and in part does not want to) do. In particular, winsor2 allows to replace an extant variable by its winsorized version, but it also allows to 'winsorize' different numbers (or percentages) of cases on both ends of the distribution. Furthermore, this procedure can be used to trim a variable.

Both ado files can be installed from ssc:

ssc install winsor
ssc install winsor2

winsor

This procedure requires two options: One option informs Stata about the number or the percentage of cases to be modified in each tail; this translates into h() followed by a number that is at least 1 and not larger than half of the cases, or p() followed by a fraction larger than 0 and smaller than .5. The other option indicates the name of an as yet nonextant variable to which the winsorized values will be written.

Thus,

winsor income, p(.1) gen(inc_w10)

will recode the bottom and the top 10 per cent of the cases in variable 'income' to the values corresponding to the 10th and the 90th percentile, respectively, and write the result to variable inc_w10.

winsor income, h(100) gen(inc_h100)

will recode the bottom and the top 100 cases to the values of the largest (at the bottom) and the smallest (at the top) of these cases, respectively, and write the result to variable inc_w10.

winsor2

This procedure may be invoked without using any options; in this case, 1 per cent at each tail of the distribution will be winsorized and the resulting variable will be written to a variable the name of which is derived from the original variable name by adding "_w" at the end. More flexibility can be achieved by using options, as in:

winsor income, cuts(5 80) suffix(_new)

Here, 5 per cent of the cases at the bottom and 20 per cent at the top of the distribution will be winsorized; the name of the new variable is created by using the original name and appending "_new". As you can see, you are not required to winsorize an equal number of cases at each tail.

Finally,

winsor income, trim cuts(5 80) suffix(_tr)

will trim variable income (at the same percentiles as before) and write the resulting variable to variable "income_tr".