relevance models

105 views Asked by At

The relevance model just estimates the relevance feedback based on feedback documents. In this case, the relevance model would have a higher probability of getting common words as its feedbacks. Thus I assumed the performance of the relevance model won't be so good comparing to the other two models. However, I learned that all those models perform pretty well. What would be the reason for that?

1

There are 1 answers

0
Debasis On

"In contrast, the relevance model just estimates the relevance feedback based on feedback documents. In this case, the relevance model would have a higher probability of getting common words as its feedbacks"

That's a common perception which isn't necessarily true. To be more specific, recall that the estimation equation of relevance model looks like:

P(w|R) = \sum_{D \in Top-K} P(w|D) \prod_{t \in Q} P(q|D)

which in simple English means that --

To compute the weight of a term w in the set of top-K docs - you iterate over each document in top-K and multiply P(w|D) with the similarity score of Q with D (this is the value \prod_{t \in Q} P(q|D)). Now, the idf factor is hidden inside the expression P(w|D).

Following the standard language model paradigm (Jelinek-Mercer or Dirichlet), this isn't just a simple max-likelihood estimate but is rather a collection smoothed version, e.g., for Jelinek-Mercer, this is:

P(w|D) = log(1+ lambda/(1-lambda) * count(w,D)/length(D) * collection_size/cf(t))

which is nothing but a linear combination based generalization of tf*idf - the second component collection_size/cf(t) specifically denoting inverse collection frequency.

So, this expression of P(w|D) ensures that terms with higher idf values tend to get higher weights in the relevance model estimation. In addition to the high idf weights, they should also have a high level of co-occurrence with the query terms due to the product of P(w|D) with P(q|D).