Rank by maximum bm25 score on a field of type array<string>

119 views Asked by At

I have a schema that has a field of type array<string>:

field titles type array<string> { 
        indexing: index | summary | attribute
        index: enable-bm25
        attribute: fast-search
}

Say titles contains N titles - Title 1, Title 2, ..., Title N. I would like to rank documents based on the max bm25 between one of the titles in titles and the query. In other words I would like the rank of the document to be equal to max(bm25('Title 1'),bm25('Title 2'),...,bm25('Title N'))

Just setting the ranking expression to bm25(titles) does not achieve what I want. For e.g. given a query Q with terms: term 1, term 2, term 3 and two documents:

  • doc 1: {"titles": [".\*term 1.\*", ".\*term 1.\*", ".\*term 1.\*", ".\*term 1.\*"]}
  • doc 2: {"titles": [".\*term 1 term 2 term 3.\*", STRING_WITH_NONE_OF_THE_TERMS, STRING_WITH_NONE_OF_THE_TERMS, STRING_WITH_NONE_OF_THE_TERMS, STRING_WITH_NONE_OF_THE_TERMS]

Having the bm25(titles) ranking expression ranks doc 1 higher than doc 2. I assume it's because a term from the query is in all titles, while in the second doc a term from the query is only in one title. I want doc 2 to be ranked higher as it contains a title that is an almost complete match for the query, so max(bm25) should be higher for doc 2 but average/sum over all docs might be higher for doc 1

Is there a way I can achieve that in Vespa?

1

There are 1 answers

1
Jo Kristian Bergum On BEST ANSWER

Thanks for the detailed question. Vespa does not support this for the bm25 rank feature. It is computed over all elements.

You can achieve similar functionality using rank-features designed for multi-valued fields. See https://docs.vespa.ai/en/searching-multi-valued-fields.html, https://docs.vespa.ai/en/reference/rank-features.html#features-for-indexed-multivalue-string-fields.

Unrelated: Note that unless you want to group on this field, you don't want to use attribute as it puts everything in memory.

field titles type array<string> { 
        indexing: index | summary 
        index: enable-bm25
}