Off-line clustering using solr?

229 views Asked by At

I want to cluster my indexed data in solr. Each solr document contains the following fields : id, title, url.

I have read solr 7.7 docs and the clustering algorithm mentioned there is applied only to the search result of each single query. And my need is a full index clustering based on the document title.

Anyone could help?

2

There are 2 answers

2
Stanislaw Osinski On BEST ANSWER

As far as I'm aware, there's no out-of-the-box plugin for clustering the whole Solr index.

If you have some background in machine learning, have a look at Apache Mahout, it should be suitable for clustering a dataset of this size. Alternatively, there's a commercially-licensed Carrot2 spin-off we develop called Lingo4G, which is designed for clustering large collections of text. In both cases, however, there is no direct integration with Solr -- you'd need to handle the integration on your own.

0
rscavilla On

Results clustering was removed in solr 8.x. The reason sited on the solr website was “The search results clustering contrib (Carrot2) has been removed from 8.x Solr due to lack of Java 1.8 compatibility in the dependency that provides online clustering of search results.”

Here is how I got it to work on JVM 11. All necessary files can be downloaded from this Github repo!

  1. Follow the instructions for installing the clustering contrib: https://solr.apache.org/guide/8_1/result-clustering.html
  2. Add solr-clustering-8.7.0.jar to /solr-8.x.x/dist directory (I tested this jar up to Solr version 8.11.1)
  3. Create /solr-8.x.x/contrib/clustering directory and copy the files in marked for contrib
  4. restart solr

Tested with java 11