How do I choose a linkage method for Hierarchical Agglomerative Clustering?

2.6k views Asked by At

I understand that HAC has several options in terms of linkage functions. You have:

  • Single linkage which produces "straggly" clusters
  • Complete linkage which produces tight, spherical clusters
  • Average linkage which is sort of a compromise between the two
  • Ward's method, which is based more off the variance than actual distance

What I'm trying to figure out is, how do I know which one of these I want to use? Are there certain datasets where "straggly" clusters are preferable to spherical ones? Or is it more a function of what I intend to do with the clustering data?

1

There are 1 answers

1
Has QUIT--Anony-Mousse On

It depends on your data.

Single-linkage works reasonably well on clean data.

If you have dirty data, the other linkages may be better.

Ward is similar to k-means. It may be a good choice if you want to talk about centroids and data partitioned completely into disjoint subsets.

The other problem is that only SLINK (for single-linkabe) is fast. All the others usually work in O(n^3) so they are not usable on large data sets. Compare this to e.g. DBSCAN which runs in O(n log n) if done well, or kmeans in O(n)...