Elasticsearch relevance- documents with similar names

133 views Asked by At

I am looking for an approach to deal with elasticsearch's relevance for document names like "bottle" and "bottle caps"

When someone looks for a "bottle" (search term), - "bottle caps" should be scored lower than "Red bottles".

Currently our search engine scores "red coloured bottle" to be less relevant than "Bottle caps for 500ml bottle"

1

There are 1 answers

2
dshockley On BEST ANSWER

This is not something you can solve in Elasticsearch, without adding more information. You want to rank "red bottles" over "bottle caps" because you know semantic information about these names -- you know that "red bottles" means the thing it's talking about is a "bottle", and "bottle caps" means the thing it's talking about is something else (related to bottles, but not actually a bottle). If you want ranking from Elasticsearch to take this information into account, you have to index the information (maybe add a keyword tag field, one with "bottle" and one with "bottle caps" -- you will have to experiment to see what works with your use case). Of course this means that a person has to ad tags for everything.

However, I suspect you can improve the situation some with the unique filter. My guess is that you don't care a lot about term frequency in a single title ("Bottle caps for 500ml bottle" isn't more about bottles because "bottle" appears twice in it -- term frequency makes little sense for titles like this I think). So you could do something like this:

PUT /myindex
{
  "settings": {
    "index": {
      "number_of_shards": 1
    },
    "analysis": {
      "analyzer": {
        "uniq_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "porter_stem",
            "unique"
          ]
        }
      }
    }
  },
  "mappings": {
    "doc": {
      "properties": {
        "name": {
          "type": "text",
          "analyzer": "uniq_analyzer"
        }
      }
    }
  }
}

PUT /myindex/doc/1
{"name": "Red coloured bottles"}

PUT /myindex/doc/2
{"name": "Bottle caps for 500ml bottle"}

Then if you search bottle, you'll see the scores are identical -- not perfect, but an improvement. In case you want to understand where a score is coming from, you can use explain:

POST /myindex
{
  "explain": true,
  "query": {
    "match": 
      {"name": "bottle"}
  }
}