How does elasticsearch count tf-idf? That looks weird

39 views Asked by At

I have an index with documents that store system information and searchable fields that are copied into searchable_keys field In this case, there is only one such field - name.

Here's the definition of the index:

{
  "settings":{
    "analysis":{
      "analyzer":{
        "my_analyzer":{
          "filter":[
            "lowercase"
          ],
          "type":"custom",
          "tokenizer":"my_tokenizer"
        }
      },
      "tokenizer":{
        "my_tokenizer":{
          "token_chars":[
            "letter",
            "digit"
          ],
          "type":"edge_ngram",
          "min_gram":3,
          "max_gram":20
        }
      }
    }
  },
  "mappings":{
    "properties":{
      "entry_id":{
        "type":"keyword"
      },
      "workspace_id":{
        "type":"keyword"
      },
      "name":{
        "type":"text",
        "copy_to":"searchable_keys"
      },
      "searchable_keys":{
        "type":"text",
        "analyzer":"my_analyzer"
      }
    }
  }
}

I ran the following query:

{
  "explain":true,
  "query":{
    "match":{
      "searchable_keys":{
        "query":"dog",
        "operator":"AND"
      }
    }
  }
}

and I got a strange result (the full documents from the response are listed below): the document with name • Private Emerald Lake & Dogsledding Tour • has score 3.7377324 while the document with name Skagway Sled Dog and Musher's Camp has score 3.718998.

Full documents from the response:

[
  {
    "_index":"tours",
    "_id":"018bb59a-bc8c-76a2-9e76-eaf747bac7c1",
    "_score":3.7377324,
    "_source":{
      "entry_id":"018bb59a-bc8c-76a2-9e76-eaf747bac7c1",
      "workspace_id":"018bb598-708a-7e8d-8995-b30cf0aba239",
      "name":"• Private Emerald Lake & Dogsledding Tour •",
      "type":"Tour"
    },
    "_explanation":{
      "value":3.7377324,
      "description":"weight(searchable_keys:dog in 68) [PerFieldSimilarity], result of:",
      "details":[
        {
          "value":3.7377324,
          "description":"score(freq=1.0), computed as boost * idf * tf from:",
          "details":[
            {
              "value":2.2,
              "description":"boost",
              "details":[
                
              ]
            },
            {
              "value":4.017076,
              "description":"idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
              "details":[
                {
                  "value":6,
                  "description":"n, number of documents containing term",
                  "details":[
                    
                  ]
                },
                {
                  "value":360,
                  "description":"N, total number of documents with field",
                  "details":[
                    
                  ]
                }
              ]
            },
            {
              "value":0.4229368,
              "description":"tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
              "details":[
                {
                  "value":1.0,
                  "description":"freq, occurrences of term within document",
                  "details":[
                    
                  ]
                },
                {
                  "value":1.2,
                  "description":"k1, term saturation parameter",
                  "details":[
                    
                  ]
                },
                {
                  "value":0.75,
                  "description":"b, length normalization parameter",
                  "details":[
                    
                  ]
                },
                {
                  "value":23.0,
                  "description":"dl, length of field",
                  "details":[
                    
                  ]
                },
                {
                  "value":19.447222,
                  "description":"avgdl, average length of field",
                  "details":[
                    
                  ]
                }
              ]
            }
          ]
        }
      ]
    }
  },
  {
    "_index":"tours",
    "_id":"018bb598-e50e-7d6d-a639-97ed40bb2ee7",
    "_score":3.718998,
    "_source":{
      "entry_id":"018bb598-e50e-7d6d-a639-97ed40bb2ee7",
      "workspace_id":"018bb598-708a-7e8d-8995-b30cf0aba239",
      "name":"Skagway Sled Dog and Musher's Camp",
      "type":"Tour"
    },
    "_explanation":{
      "value":3.718998,
      "description":"weight(searchable_keys:dog in 105) [PerFieldSimilarity], result of:",
      "details":[
        {
          "value":3.718998,
          "description":"score(freq=1.0), computed as boost * idf * tf from:",
          "details":[
            {
              "value":2.2,
              "description":"boost",
              "details":[
                
              ]
            },
            {
              "value":3.3953834,
              "description":"idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
              "details":[
                {
                  "value":11,
                  "description":"n, number of documents containing term",
                  "details":[
                    
                  ]
                },
                {
                  "value":342,
                  "description":"N, total number of documents with field",
                  "details":[
                    
                  ]
                }
              ]
            },
            {
              "value":0.49786824,
              "description":"tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
              "details":[
                {
                  "value":1.0,
                  "description":"freq, occurrences of term within document",
                  "details":[
                    
                  ]
                },
                {
                  "value":1.2,
                  "description":"k1, term saturation parameter",
                  "details":[
                    
                  ]
                },
                {
                  "value":0.75,
                  "description":"b, length normalization parameter",
                  "details":[
                    
                  ]
                },
                {
                  "value":15.0,
                  "description":"dl, length of field",
                  "details":[
                    
                  ]
                },
                {
                  "value":19.052631,
                  "description":"avgdl, average length of field",
                  "details":[
                    
                  ]
                }
              ]
            }
          ]
        }
      ]
    }
  }
]

Questions:

  1. Why is the idf of the two documents is different? Because idf for the unique word is the same for all documents in the collection. Am I wrong?

  2. What's the weird formula for tf? Isn't the formula equal to the frequency of word occurrences divided by the number of words in the document?

  3. How can I make it so that if a document has a separate word "dog", the document had more points than if the substring "dog" is an occurrence in some word? And at the same time not to lose the ability to search by occurrences, which is given by edge n-gram tokenizer

1

There are 1 answers

1
vriv On
  1. Based on the basic tf-idf formula (https://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting), yes idf should be a constant. The formula is slightly different in your configuration but it should still be a constant. Could it be that the 2 answers come from two different documents set? I see that the "entry_id" differ (not an expert at elastic search).
  2. In the link I provided, you can find : "term frequency, the number of times a term occurs in a given document", aka a raw number and not a ratio. I assume the people who calculated tf-idf in your case manipulated the formula for some reason. Don't know if it's a native Elastic Search formula or if someone else implemented it.