Can the results of UMAP for HDBScan clustering be made more consistent?

464 views Asked by At

I have a set of ~40K phrases which I'm clustering with HDBScan after using UMAP for dimensionality reduction. The steps are:

  1. Generate embeddings using a fine-tuned BERT model
  2. Reduce dimensions with UMAP
  3. Cluster with HDBScan

I'm finding that sometimes, HDBScan finds 100-200 clusters, which is the desired result. But other times, it finds only 2-4. This is with the same dataset and no change in parameters either for UMAP or HDBScan.

From the UMAP documentation I see that UMAP is a stochastic algorithm, so complete reproducibility should not be expected. But it also says "the variance between runs should ideally be relatively small", which is not the case here. Also, the variance seems to be bimodal -- I either end up with 2-4 clusters or 100+, nothing in between.

I've tried different values of parameters for both UMAP (n_components: 3, 4, 6, 10; min_dist: 0.0, 0.1, 0.3, 0.5; n_neighbors: 15, 30) and HDBScan (min_cluster_size: 50, 100, 200) but with all combinations so far, I still occasionally get the undesired 2-4 clusters.

Why is UMAP behaving this way, and how can I ensure it yields the desired 100+ clusters rather than 2-4?

1

There are 1 answers

0
Maciej Skorski On

Unless we see some data to reproduce (e.g. the tensor of embeddings) we can only give educated guesses.

First, I would suggest to plot few runs of UMAP with their visualization utils on the embeded texts (output of BERT). Note that 40k is definitely "plotable", see the linked tutorial.

Second, there may be some non-obvious data issues that causes UMAP to be less stable. One such issue is when there are many duplicates (e.g. happens many times when analysing spoken phrases). Another one may be due to GPU-accelerate implementations that may be experimental (based on recent research papers and not validated fully).

I used to run the combination BERT + UMAP + HDBScan on millions of phrases, with results stable despite randomization. I would blame either the data or the specific implementation.