Clustering index documents via vector search?

164 views Asked by At

I'm wondering if there's anything already built into the Cognitive Search vector index to return clusters of similar embeddings, or if that would still be left up to the developer (i.e. us) to run an offline algorithm on the embeddings directly?

I know I can get on-demand clusters, essentially, by running a similarity search from one embedding, but I'm looking to take a set of document embeddings and identify similar clusters so we can auto-generate collections of documents which may be topically related.

Something like this: https://dylancastillo.co/clustering-documents-with-openai-langchain-hdbscan

My assumption is that this would be something for us to handle, but I wanted to make sure this wasn't available already in Cognitive Search.

1

There are 1 answers

0
Robert - MSFT On

The problem is the HNSW graph structure is not optimized for clustering applications since it focuses on finding nearest neighbors efficiently. As a result, the hierarchical near world graph structure doesn't make any guarantees that all nearest neighbors will be within a certain distance in the graph from a given node, making navigating the graph to identify clusters suboptimal.

Conversely, clustering usually involves organizing data points into separate and well-defined groups with good separation, compactness, robustness, size, etc. If you're looking for such clustering applications, consider running dedicated clustering algorithms on the vector corpus such as k-means, DBSCAN (Density-Based Spatial Clustering of Applications with Noise), or AgglomerativeClustering. All of these are available using scikit.learn.

You might also be able to run an "ad-hoc clustering" using your hnsw index by retrieving a larger set of k approximate nearest neighbors for a given query point, then run a clustering algorithm to see if there are good group separation of those points. Then you can identify which group your query point belongs to and use that as the cluster. This may help handle the case where your query point may not be centered in a particular cluster (whose distribution you don't know yet), and thus the "nearest neighbors" may contain points in other clusters.