Full Text Search with Medium Weight Sizes gives Unexpected Ranking

44 views Asked by At

Given a SQL Server table with one column [SearchableDescription] included in the full text search catalog/index with example data:

  1. apple banana cherry
  2. apple banana cherry grape
  3. apple banana cherry grape yam
  4. apple banana cherry grape yam zucchini
  5. ...

They we search using the Containstable() like:

declare
    @aboutPredicate nvarchar(4000) = 
        N'IsAbout(
            cherry weight (0.5),    
            grape weight (0.5)
      )';

select *
  from TheTable t
  join ContainsTable(TheTable, SearchableDescription, @aboutPredicate) ct
    on ct.Key = t.RowId

The problem is with the [Rank] column output. When the input weights are >= .3 as shown above, then often rows containing both cherry and grape are given a lower rank than rows with only one of cherry or grape.

However if the weights are adjusted to be much lower (~.1) like:

declare
    @aboutPredicate nvarchar(4000) = 
        N'IsAbout(
            cherry weight (0.1),    
            grape weight (0.1)
      )';

Then the rows containing both search terms are ranked highest.

I recall there being uniqueness of words component built into the rank calculation but am surprised it could affect the answer so much as to ignore the fact of a word being matched. In the real test case most of our words and search terms are quite unique (part numbers, technical family names, etc) so even when both terms are as such, this still seems to happen.

Would like to understand the reason behind this behavior. And if using small input weights is acceptable, is there some other disadvantages to it?

Update: Noting that having the weights too small (<=.05) also caused the same issue. The most important thing appears to be the balance of weights. For our case (which may not be typical) each scale must be within ~10% of the other in order for both words being present to be ranked above case where just 1 word was present. The 10% applied to +/- to either word (the words are likely near the same document/corpus frequencies). Even with the weights being equal, too high or too low a value still caused the issue.

1

There are 1 answers

0
Adrian Maxwell On BEST ANSWER

Frequency-Inverse Document Frequency (TF-IDF) ranking model:

The ranking algorithm used in full text serach is based on a combination of factors with one of these factors being the frequency of the search terms in the document compared to their frequency in the entire set of documents being searched. The impact of this can be counter-intuitive e.g:

  • The rank score depends on how often your words appear in each document and in all documents together i.e. counting how many times you see "cherry" and "grape" in one book and then in all books.

  • If you give "cherry" and "grape" a low score (like 0.1), these words are not that important. So, the algorithm looks at other things like how many different words are in the document or how long the document is. Documents with both "cherry" and "grape" might get higher scores in such a scenario.

  • But if you give "cherry" and "grape" a high score (like 0.5), these words become way more important. So, the algorithm focuses more on these words. Documents with just "cherry" or just "grape" might get higher scores because these words are so important (and also less diluted by other terms).

Understanding TF-IDF (Term Frequency-Inverse Document Frequency)