I have an index with documents that store system information and searchable fields that are copied into searchable_keys field In this case, there is only one such field - name.
Here's the definition of the index:
{
"settings":{
"analysis":{
"analyzer":{
"my_analyzer":{
"filter":[
"lowercase"
],
"type":"custom",
"tokenizer":"my_tokenizer"
}
},
"tokenizer":{
"my_tokenizer":{
"token_chars":[
"letter",
"digit"
],
"type":"edge_ngram",
"min_gram":3,
"max_gram":20
}
}
}
},
"mappings":{
"properties":{
"entry_id":{
"type":"keyword"
},
"workspace_id":{
"type":"keyword"
},
"name":{
"type":"text",
"copy_to":"searchable_keys"
},
"searchable_keys":{
"type":"text",
"analyzer":"my_analyzer"
}
}
}
}
I ran the following query:
{
"explain":true,
"query":{
"match":{
"searchable_keys":{
"query":"dog",
"operator":"AND"
}
}
}
}
and I got a strange result (the full documents from the response are listed below):
the document with name • Private Emerald Lake & Dogsledding Tour • has score 3.7377324 while the document with name Skagway Sled Dog and Musher's Camp has score 3.718998.
Full documents from the response:
[
{
"_index":"tours",
"_id":"018bb59a-bc8c-76a2-9e76-eaf747bac7c1",
"_score":3.7377324,
"_source":{
"entry_id":"018bb59a-bc8c-76a2-9e76-eaf747bac7c1",
"workspace_id":"018bb598-708a-7e8d-8995-b30cf0aba239",
"name":"• Private Emerald Lake & Dogsledding Tour •",
"type":"Tour"
},
"_explanation":{
"value":3.7377324,
"description":"weight(searchable_keys:dog in 68) [PerFieldSimilarity], result of:",
"details":[
{
"value":3.7377324,
"description":"score(freq=1.0), computed as boost * idf * tf from:",
"details":[
{
"value":2.2,
"description":"boost",
"details":[
]
},
{
"value":4.017076,
"description":"idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details":[
{
"value":6,
"description":"n, number of documents containing term",
"details":[
]
},
{
"value":360,
"description":"N, total number of documents with field",
"details":[
]
}
]
},
{
"value":0.4229368,
"description":"tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details":[
{
"value":1.0,
"description":"freq, occurrences of term within document",
"details":[
]
},
{
"value":1.2,
"description":"k1, term saturation parameter",
"details":[
]
},
{
"value":0.75,
"description":"b, length normalization parameter",
"details":[
]
},
{
"value":23.0,
"description":"dl, length of field",
"details":[
]
},
{
"value":19.447222,
"description":"avgdl, average length of field",
"details":[
]
}
]
}
]
}
]
}
},
{
"_index":"tours",
"_id":"018bb598-e50e-7d6d-a639-97ed40bb2ee7",
"_score":3.718998,
"_source":{
"entry_id":"018bb598-e50e-7d6d-a639-97ed40bb2ee7",
"workspace_id":"018bb598-708a-7e8d-8995-b30cf0aba239",
"name":"Skagway Sled Dog and Musher's Camp",
"type":"Tour"
},
"_explanation":{
"value":3.718998,
"description":"weight(searchable_keys:dog in 105) [PerFieldSimilarity], result of:",
"details":[
{
"value":3.718998,
"description":"score(freq=1.0), computed as boost * idf * tf from:",
"details":[
{
"value":2.2,
"description":"boost",
"details":[
]
},
{
"value":3.3953834,
"description":"idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details":[
{
"value":11,
"description":"n, number of documents containing term",
"details":[
]
},
{
"value":342,
"description":"N, total number of documents with field",
"details":[
]
}
]
},
{
"value":0.49786824,
"description":"tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details":[
{
"value":1.0,
"description":"freq, occurrences of term within document",
"details":[
]
},
{
"value":1.2,
"description":"k1, term saturation parameter",
"details":[
]
},
{
"value":0.75,
"description":"b, length normalization parameter",
"details":[
]
},
{
"value":15.0,
"description":"dl, length of field",
"details":[
]
},
{
"value":19.052631,
"description":"avgdl, average length of field",
"details":[
]
}
]
}
]
}
]
}
}
]
Questions:
Why is the idf of the two documents is different? Because idf for the unique word is the same for all documents in the collection. Am I wrong?
What's the weird formula for tf? Isn't the formula equal to the frequency of word occurrences divided by the number of words in the document?
How can I make it so that if a document has a separate word "dog", the document had more points than if the substring "dog" is an occurrence in some word? And at the same time not to lose the ability to search by occurrences, which is given by edge n-gram tokenizer