Cluster Analysis

I'm afraid I cannot really recommend Stata's cluster analysis module. The output is simply too sparse. Perhaps there are some ados available of which I'm not aware. Anyway, if you have to do it, here you'll see how.

Hierarchical cluster analysis

cluster ward var17 var18 var20 var24 var25 var30
cluster gen gp = gr(3/10)
cluster tree, cutnumber(10) showcount

In the first step, Stata will compute a few statistics that are required for analysis. The second step does the clustering. Finally, the third command produces a tree diagram or dendrogram, starting with 10 clusters.

Now, a few words about the first two command lines.

In cluster ward var17 ... the interesting thing is cluster, which requires a cluster analysis according to the Ward method (minimizing within-cluster variation). Other methods are available; the keywords are largely self-explaining for those who know cluster analysis:

  • s or singlelinkage
  • a or averagelinkage
  • wav or waveragelinkage
  • med or medianlinkage
  • centr or centroidlinkage

waveragelinkage stands for weighted average linkage.

What about dissimilarity measures? There is a default measure for each of the methods; in the case of the Ward method, it's the squared Euclidian distance. But many other measures are available which can be requested via option measure(keyword). See the Stata help for details about the available keywords.

Now, the second command does the actual clustering. If you have just accomplished the first step, the second command will build immediately on it. What the command presented here does is compute cluster solutions for 10 to 3 clusters and store the grouping of cases for each solution. gp means that the grouping will be stored in variables that start with the characters "gp". That is, afterwards you will find variables "gp3", "gp4" and so on in your data set.

You can refer to cluster computations (first step) that were accomplished earlier. For instance, if you are using the cluster command the way I have done here, Stata will store some values in variables whose names start with "_clus_1" if it's the first cluster analysis on this data set, and so on for each additional computation. If you want refer to this at a later stage (for instance, after having done some other cluster computations), you can do so with via the "name" option:

cluster gen gp = gr(3/10), name(_clus_1)

Of course, this presupposes that the variables that start with "_clus_1" are still present, which means that either you have not finished your session or you have saved the data set containing these variables.

K-means clustering

K-means clustering means that you start from pre-defined clusters. "Pre-defining" can happen in a number of ways. I give only an example where you already have done a hierarchical cluster analysis (or have some other grouping variable) and wish to use K-means clustering to "refine" its results (which I personally think is recommendable).

cluster k var17 var18 var20 var24 var25 var30, k(7) name (gp7k) start(group(gp7))

cluster k is the keyword for k-means clustering. Next, the variables to be used are enumerated. The options work as follows: k(7) means that we are dealing with seven clusters. The resulting allocation of cases to clusters will be stored in variable "gp7k". The analysis will start from the grouping of cases accomplished before, stored in variable "gp7".

© W. Ludwig-Mayerhofer, Stata Guide | Last update: 21 Feb 2009