I was trying to develop a clustering algorithm tasked with finding k classes on a set of 2D points, (with k given as input) using use the Kruskal algorithm lightly modified to find k spanning trees instead of one.
I compared my output to a proposed optimum (1) using the rand index, which for k = 7 resulted on 95.5%. The comparison can be seen on the link below.
Problem:
The set have 5 clearly spaced clusters that are easily classified by the algorithm, but the results are rather disappointing for k > 5, which is when things start to get tricky. I believe that my algorithm is correct, and maybe the data is particularly bad for a Kruskal approach. Single Linkage Agglomerative Clustering, such as Kruskal's, are known to perform badly at some problems since it reduces the assessment of cluster quality to a single similarity between a pair of points.
The idea of the algorithm is very simple:
- Make a complete graph with the data set, with the weight of the edges being the euclidean distance between the pair.
- Sort the edge list by weight.
- For each edge (in order), add it to the spanning forest if it doesn't form a cycle. Stop when all the edges have been traversed or when the remaining forest has k trees.
Bottomline: Why is the algorithm failing like that? Is it Kruskal's fault? If so, why precisely? Any suggestions to improve the results without abandoning Kruskal?
(1): Gionis, A., H. Mannila, and P. Tsaparas, Clustering aggregation. ACM Transactions on Knowledge Discovery from Data(TKDD),2007.1(1):p.1-30.
This is known as single-link effect.
Kruskal seems to be a semi-clever way of computing single-linkage clustering. The naive approach for "hierarchical clustering" is
O(n^3)
, and the Kruskal approach should beO(n^2 log n)
due to having to sort then^2
edges.Note that SLINK can do single-linkage clustering in
O(n^2)
runtime andO(n)
memory.Have you tried loading your data set e.g. into ELKI, and compare your result to single-link clustering.
To get bette results, try other linkages (usually in
O(n^3)
runtime) or density-based clustering such as DBSCAN (inO(n^2)
without index, andO(n log n)
with index). On this toy data set,epsilon=2
andminPts=5
should work good.