Improve score if the field starts with the term

348 views Asked by At

I'm trying to do an efficient auto-complete search input on my website, to search cities. I assume that people will always start to search their city name, with the right order of words. E.g. a user who live in Saint-Maur will type sai.. but will never type mau.. in first place.

I need to improve the score of results, if the result starts with the term from the query. E.g. if a user type pari, the city Parigné-le-Pôlin should have a better score than Fontenay-en-Parisis, since it starts with pari.

I'm using an edge-gram filter, and a phrase match because the order of words matters. I'm sure that my problem has a simple solution, but I'm a newb in the ES magic world :)

Here is my mapping:

{
    "settings": {
        "index": {
            "number_of_shards": 1
        },

        "analysis": {
            "analyzer": {
                "partialPostalCodeAnalyzer": {
                    "tokenizer": "standard",
                    "filter": ["partialFilter"]
                },
                "partialNameAnalyzer": {
                    "tokenizer": "standard",
                    "filter": ["asciifolding", "lowercase", "word_delimiter", "partialFilter"]
                },
                "searchAnalyzer": {
                    "tokenizer": "standard",
                    "filter": ["asciifolding", "lowercase", "word_delimiter"]
                }
            },

            "filter": {
                "partialFilter": {
                    "type": "edge_ngram",
                    "min_gram": 1,
                    "max_gram": 50
                }
            }
        }
    },

    "mappings": {
        "village": {
            "properties": {
                "postalCode": {
                    "type": "string",
                    "index_analyzer": "partialPostalCodeAnalyzer",
                    "search_analyzer": "searchAnalyzer"
                },

                "name": {
                    "type": "string",
                    "index_analyzer": "partialNameAnalyzer",
                    "search_analyzer": "searchAnalyzer"
                },

                "population": {
                    "type": "integer",
                    "index": "not_analyzed"
                }
            }
        }
    }
}

Some sample:

PUT /tv_village/village/1 {"name": "Paris"}
PUT /tv_village/village/2 {"name": "Parigny"}
PUT /tv_village/village/3 {"name": "Fontenay-en-Parisis"}
PUT /tv_village/village/4 {"name": "Parigné-le-Pôlin"}

If I perform this query, you can see that results are not in the order I want them to be (I want the 4th result to be before the 3d one):

GET /tv_village/village/_search
{
  "query": {
    "match_phrase": {
      "name": "pari"
    }
  }
}

Results:

      "hits": [
         {
            "_index": "tv_village",
            "_type": "village",
            "_id": "1",
            "_score": 0.7768564,
            "_source": {
               "name": "Paris"
            }
         },
         {
            "_index": "tv_village",
            "_type": "village",
            "_id": "2",
            "_score": 0.7768564,
            "_source": {
               "name": "Parigny"
            }
         },
         {
            "_index": "tv_village",
            "_type": "village",
            "_id": "3",
            "_score": 0.3884282,
            "_source": {
               "name": "Fontenay-en-Parisis"
            }
         },
         {
            "_index": "tv_village",
            "_type": "village",
            "_id": "4",
            "_score": 0.3884282,
            "_source": {
               "name": "Parigné-le-Pôlin"
            }
         }
      ]
1

There are 1 answers

2
Andrei Stefan On BEST ANSWER

In your mapping definition, put another analyzer:

            "keywordLowercaseAnalyer": {
              "tokenizer": "keyword",
              "filter": ["lowercase"]
            }

meaning, keep the word intact (through keyword analyzer) and lowercase it (like "parigné-le-pôlin"). Then define for your name field another two fields:

  • one raw that should be not_analyzed
  • one raw_lowercase that should use keywordLowercaseAnalyer

    "name": {
      "type": "string",
      "index_analyzer": "partialNameAnalyzer",
      "search_analyzer": "searchAnalyzer",
      "fields": {
        "raw": {
          "type": "string",
          "index": "not_analyzed"
        },
        "raw_lowercase": {
          "type": "string",
          "analyzer": "keywordLowercaseAnalyer"
        }
      }
    }
    

I'm doing this because you can have searches for "pari" or "Pari". In your query, use the rescore functionality to recompute the scoring based on an additional query:

{
  "query": {
    "match_phrase": {
      "name": "pari"
    }
  },
  "rescore": {
    "query": {
      "rescore_query": {
        "bool": {
          "should": [
            {"prefix": {"name.raw": "pari"}},
            {"prefix": {"name.raw_lowercase": "pari"}}
          ]
        }
      }
    }
  }
}

There are two drawbacks, from your use case point of view and regarding prefix query:

  • it is quite resource intensive
  • the value passed to a prefix is not_analyzed and this is the reason for adding those two raw* fields: one field deals with a lowercase version, the other deals with the untouched version so that queries for "pari" or "Pari" cover these scenarios.

I have two suggestions:

  • test the query above on your real data to see how it behaves, performance wise
  • play with window_size attribute for rescore query to limit the number of values the rescoring is performed on, thus improving the performance.

For your reference, this is the documentation page for rescore.