Understanding Apache Lucene's scoring algorithm

3.4k views Asked by At

I'm working with Hibernate Search for months now, but still I'm not able to digest the relevance it brings. I'm overall satisfied with the results it returns, but even simplest test does not satisfy my expectation.

First test was using the term frequency(tf). Data:

  • word
  • word word
  • word word word
  • word word word word
  • word word word word word
  • word word word word word word

Results I get:

  1. word
  2. word word word word
  3. word word word word word
  4. word word word word word word
  5. word word
  6. word word word

I'm really confused with this scoring effect. My Query is quite complex, but as this test did not have any other field involved, it can be simplified as below: booleanjunction.should(phraseQuery).should(keywordQuery).should(fuzzyQuery)

I've analyzers as below:

 StandardFilterFactory
 LowerCaseFilterFactory
 StopFilterFactory
 SnowballPorterFilterFactory for english

My Explanation object https://jsfiddle.net/o51kh3og/

1

There are 1 answers

1
alexf On BEST ANSWER

Scoring calculation is something really complex. Here, you have to begin with the primal equation:

score(q,d) = coord(q,d) · queryNorm(q) · ∑ ( tf(t in d) · idf(t)2 · t.getBoost() · norm(t,d) )

As you said, you have tf which means term frequency and its value is the squareroot of the frequency of the term.

But here, as you can see in your explanation, you also have norm (aka fieldNorm) which is used in fieldWeight calculation. Let's take your example:

eklavya eklavya eklavya eklavya eklavya

4.296241 = fieldWeight in 177, product of:
  2.236068 = tf(freq=5.0), with freq of:
    5.0 = termFreq=5.0
  4.391628 = idf(docFreq=6, maxDocs=208)
  0.4375 = fieldNorm(doc=177)

eklavya

4.391628 = fieldWeight in 170, product of:
  1.0 = tf(freq=1.0), with freq of:
    1.0 = termFreq=1.0
  4.391628 = idf(docFreq=6, maxDocs=208)
  1.0 = fieldNorm(doc=170)

Here, eklavya has a better score than the other because fieldWeight is the product of tf, idf and fieldNorm. This last one is higher for eklavya document because he only contains one term.

As above documentation said:

lengthNorm - computed when the document is added to the index in accordance with the number of tokens of this field in the document, so that shorter fields contribute more to the score.

The more terms you have in a field, lower fieldNorm will be. Be careful with the value of this field.

So, to conclude, here you have a perfect mix to understand that the score is not calculated only with the frequency but also with the number of term that you have in your field.