I have a schema that has a field of type array<string>
:
field titles type array<string> {
indexing: index | summary | attribute
index: enable-bm25
attribute: fast-search
}
Say titles
contains N
titles - Title 1
, Title 2
, ..., Title N
. I would like to rank documents based on the max bm25 between one of the titles in titles
and the query. In other words I would like the rank of the document to be equal to max(bm25('Title 1'),bm25('Title 2'),...,bm25('Title N'))
Just setting the ranking expression to bm25(titles)
does not achieve what I want. For e.g. given a query Q
with terms: term 1, term 2, term 3
and two documents:
- doc 1:
{"titles": [".\*term 1.\*", ".\*term 1.\*", ".\*term 1.\*", ".\*term 1.\*"]}
- doc 2:
{"titles": [".\*term 1 term 2 term 3.\*", STRING_WITH_NONE_OF_THE_TERMS, STRING_WITH_NONE_OF_THE_TERMS, STRING_WITH_NONE_OF_THE_TERMS, STRING_WITH_NONE_OF_THE_TERMS]
Having the bm25(titles)
ranking expression ranks doc 1
higher than doc 2
. I assume it's because a term from the query is in all titles, while in the second doc a term from the query is only in one title. I want doc 2
to be ranked higher as it contains a title that is an almost complete match for the query, so max(bm25) should be higher for doc 2
but average/sum over all docs might be higher for doc 1
Is there a way I can achieve that in Vespa?
Thanks for the detailed question. Vespa does not support this for the
bm25
rank feature. It is computed over all elements.You can achieve similar functionality using rank-features designed for multi-valued fields. See https://docs.vespa.ai/en/searching-multi-valued-fields.html, https://docs.vespa.ai/en/reference/rank-features.html#features-for-indexed-multivalue-string-fields.
Unrelated: Note that unless you want to group on this field, you don't want to use
attribute
as it puts everything in memory.