Carrot: different clusters for the same query

122 views Asked by At

When issuing the same query with match all query (* : *) I get different clusters and scores all the time. What could be the reason?

First try:

label: "В Минске"
score: 52.79549568196028

Second try:

label: "В Минске"
"score": 54.74385944060893

Third try:

label: "В Минске"
"score": 48.884082925408734

Document ids inside clusters are also different. Clusters themselves change: in one query response I get a cluster "тысячами евро", in the subsequent one it is gone, but new cluster appears: "Тысячами Долларов"

Is there some carrot parameter that could make clusters stable for a given query? Could it be desiredClusterCountBase ?

The Solr index is the same for all cases. Algorithm used: org.carrot2.clustering.lingo.LingoClusteringAlgorithm with StopWordLabelFilter.enabled=false and clustering.rows=1000.

1

There are 1 answers

0
D_K On

It looks like I found the reason:

  • in the index there were duplicate of each document, with only one difference: one copy had a publication date, the other did not.
  • at the same time, my date filter did not work correctly, because publication dates were incorrectly stamped on each document and ranking function with reciprocal rank could return different documents each time for the top 1000 (this part is hard to debug without looking into Solr source code)
  • clustering module would get slightly different sets of documents => clusters would change. However, one could see that most prominent clusters (by size) were still stable, only scores were changing. Less prominent clusters could be replaced by other less prominent clusters between requests.

I don't know if this is a bug still, but removing all documents from the index and putting them back with the correct publication date has solved the issue.