Can the results of UMAP for HDBScan clustering be made more consistent?

Question

Can the results of UMAP for HDBScan clustering be made more consistent?

462 views Asked by TKR At 17 November 2023 at 00:53

I have a set of ~40K phrases which I'm clustering with HDBScan after using UMAP for dimensionality reduction. The steps are:

Generate embeddings using a fine-tuned BERT model
Reduce dimensions with UMAP
Cluster with HDBScan

I'm finding that sometimes, HDBScan finds 100-200 clusters, which is the desired result. But other times, it finds only 2-4. This is with the same dataset and no change in parameters either for UMAP or HDBScan.

From the UMAP documentation I see that UMAP is a stochastic algorithm, so complete reproducibility should not be expected. But it also says "the variance between runs should ideally be relatively small", which is not the case here. Also, the variance seems to be bimodal -- I either end up with 2-4 clusters or 100+, nothing in between.

I've tried different values of parameters for both UMAP (n_components: 3, 4, 6, 10; min_dist: 0.0, 0.1, 0.3, 0.5; n_neighbors: 15, 30) and HDBScan (min_cluster_size: 50, 100, 200) but with all combinations so far, I still occasionally get the undesired 2-4 clusters.

Why is UMAP behaving this way, and how can I ensure it yields the desired 100+ clusters rather than 2-4?

Original Q&A

There are 1 answers

**Maciej Skorski** · Answer 1 · 2023-11-25T23:18:55+00:00

Unless we see some data to reproduce (e.g. the tensor of embeddings) we can only give educated guesses.

First, I would suggest to plot few runs of UMAP with their visualization utils on the embeded texts (output of BERT). Note that 40k is definitely "plotable", see the linked tutorial.

Second, there may be some non-obvious data issues that causes UMAP to be less stable. One such issue is when there are many duplicates (e.g. happens many times when analysing spoken phrases). Another one may be due to GPU-accelerate implementations that may be experimental (based on recent research papers and not validated fully).

I used to run the combination BERT + UMAP + HDBScan on millions of phrases, with results stable despite randomization. I would blame either the data or the specific implementation.

TechQA.

Can the results of UMAP for HDBScan clustering be made more consistent?

There are 1 answers

Related Questions in PYTHON

Related Questions in CLUSTER-ANALYSIS

Related Questions in DIMENSIONALITY-REDUCTION

Related Questions in HDBSCAN

Related Questions in RUNUMAP

Popular Questions

Popular Tags

Trending Questions