Improve score if the field starts with the term

Question

Improve score if the field starts with the term

400 views Asked by Raphaël Malié At 18 November 2014 at 15:51

I'm trying to do an efficient auto-complete search input on my website, to search cities. I assume that people will always start to search their city name, with the right order of words. E.g. a user who live in Saint-Maur will type sai.. but will never type mau.. in first place.

I need to improve the score of results, if the result starts with the term from the query. E.g. if a user type pari, the city Parigné-le-Pôlin should have a better score than Fontenay-en-Parisis, since it starts with pari.

I'm using an edge-gram filter, and a phrase match because the order of words matters. I'm sure that my problem has a simple solution, but I'm a newb in the ES magic world :)

Here is my mapping:

{
    "settings": {
        "index": {
            "number_of_shards": 1
        },

        "analysis": {
            "analyzer": {
                "partialPostalCodeAnalyzer": {
                    "tokenizer": "standard",
                    "filter": ["partialFilter"]
                },
                "partialNameAnalyzer": {
                    "tokenizer": "standard",
                    "filter": ["asciifolding", "lowercase", "word_delimiter", "partialFilter"]
                },
                "searchAnalyzer": {
                    "tokenizer": "standard",
                    "filter": ["asciifolding", "lowercase", "word_delimiter"]
                }
            },

            "filter": {
                "partialFilter": {
                    "type": "edge_ngram",
                    "min_gram": 1,
                    "max_gram": 50
                }
            }
        }
    },

    "mappings": {
        "village": {
            "properties": {
                "postalCode": {
                    "type": "string",
                    "index_analyzer": "partialPostalCodeAnalyzer",
                    "search_analyzer": "searchAnalyzer"
                },

                "name": {
                    "type": "string",
                    "index_analyzer": "partialNameAnalyzer",
                    "search_analyzer": "searchAnalyzer"
                },

                "population": {
                    "type": "integer",
                    "index": "not_analyzed"
                }
            }
        }
    }
}

Some sample:

PUT /tv_village/village/1 {"name": "Paris"}
PUT /tv_village/village/2 {"name": "Parigny"}
PUT /tv_village/village/3 {"name": "Fontenay-en-Parisis"}
PUT /tv_village/village/4 {"name": "Parigné-le-Pôlin"}

If I perform this query, you can see that results are not in the order I want them to be (I want the 4th result to be before the 3d one):

GET /tv_village/village/_search
{
  "query": {
    "match_phrase": {
      "name": "pari"
    }
  }
}

Results:

      "hits": [
         {
            "_index": "tv_village",
            "_type": "village",
            "_id": "1",
            "_score": 0.7768564,
            "_source": {
               "name": "Paris"
            }
         },
         {
            "_index": "tv_village",
            "_type": "village",
            "_id": "2",
            "_score": 0.7768564,
            "_source": {
               "name": "Parigny"
            }
         },
         {
            "_index": "tv_village",
            "_type": "village",
            "_id": "3",
            "_score": 0.3884282,
            "_source": {
               "name": "Fontenay-en-Parisis"
            }
         },
         {
            "_index": "tv_village",
            "_type": "village",
            "_id": "4",
            "_score": 0.3884282,
            "_source": {
               "name": "Parigné-le-Pôlin"
            }
         }
      ]

Original Q&A

There are 1 answers

**Andrei Stefan** · Accepted Answer · 2014-11-19T09:01:55+00:00

In your mapping definition, put another analyzer:

            "keywordLowercaseAnalyer": {
              "tokenizer": "keyword",
              "filter": ["lowercase"]
            }

meaning, keep the word intact (through keyword analyzer) and lowercase it (like "parigné-le-pôlin"). Then define for your name field another two fields:

one raw that should be not_analyzed

one raw_lowercase that should use keywordLowercaseAnalyer

"name": {
  "type": "string",
  "index_analyzer": "partialNameAnalyzer",
  "search_analyzer": "searchAnalyzer",
  "fields": {
    "raw": {
      "type": "string",
      "index": "not_analyzed"
    },
    "raw_lowercase": {
      "type": "string",
      "analyzer": "keywordLowercaseAnalyer"
    }
  }
}

I'm doing this because you can have searches for "pari" or "Pari". In your query, use the rescore functionality to recompute the scoring based on an additional query:

{
  "query": {
    "match_phrase": {
      "name": "pari"
    }
  },
  "rescore": {
    "query": {
      "rescore_query": {
        "bool": {
          "should": [
            {"prefix": {"name.raw": "pari"}},
            {"prefix": {"name.raw_lowercase": "pari"}}
          ]
        }
      }
    }
  }
}

There are two drawbacks, from your use case point of view and regarding prefix query:

it is quite resource intensive
the value passed to a prefix is not_analyzed and this is the reason for adding those two raw* fields: one field deals with a lowercase version, the other deals with the untouched version so that queries for "pari" or "Pari" cover these scenarios.

I have two suggestions:

test the query above on your real data to see how it behaves, performance wise
play with window_size attribute for rescore query to limit the number of values the rescoring is performed on, thus improving the performance.

For your reference, this is the documentation page for rescore.

TechQA.

Improve score if the field starts with the term

There are 1 answers

Related Questions in ELASTICSEARCH

Related Questions in SCORING

Popular Questions

Popular Tags

Trending Questions