Elastic search edge ngram not returning all expected results

1.4k views Asked by At

I am having a hard time in finding the elastic search query unexpected results. Indexed the following documents into elastic search.

{
"group": "J00-I99", codes: [
   { "id": "J15", "description": "hello world" },
   { "id": "J15.0", "description": "test one world" },
   { "id": "J15.1", "description": "test two world J15.0" },
   { "id": "J15.2", "description": "test two three world J15" },
   { "id": "J15.3", "description": "hello world J18 " },
    ............................ // Similar records here
   { "id": "J15.9", "description": "hello world new" },
   { "id": "J16.0", "description": "new description" }
]
}

Here my aim is to implement autocomplete functionality and for that I used n-gram approach. I don't want to use complete suggester approach.

Currently I am stuck with two issues:

  1. Search query (both id and description fields ) : J15

Expected result: All the above results which includes J15 Actual result: Getting only few results (J15.0, J15.1, J15.8)

  1. Search query (both id and description fields ) : test two

Expected result:

{ "id": "J15.1", "description": "test two world J15.0" },
{ "id": "J15.2", "description": "test two three world J15" },

Actual Result:

   { "id": "J15.0", "description": "test one world" },
   { "id": "J15.1", "description": "test two world J15.0" },
   { "id": "J15.2", "description": "test two three world J15" },

Then mapping is done like this.

           {

                settings: {
                    number_of_shards: 1,
                    analysis: {
                        filter: {
                            ngram_filter: {
                                type: 'edge_ngram',
                                min_gram: 2,
                                max_gram: 20
                            }
                        },
                        analyzer: {
                            ngram_analyzer: {
                                type: 'custom',
                                tokenizer: 'standard',
                                filter: [
                                    'lowercase', 'ngram_filter'
                                ]
                            }
                        }
                    }
                },
                mappings: {
                    properties: {
                        group: {
                            type: 'text'
                        },
                        codes: {
                            type: 'nested',
                            properties: {
                                id: {
                                    type: 'text',
                                    analyzer: 'ngram_analyzer',
                                    search_analyzer: 'standard'
                                },
                                description: {
                                    type: 'text',
                                    analyzer: 'ngram_analyzer',
                                    search_analyzer: 'standard'
                                }
                            }
                        }
                    }
                }
            }

Search Query:

GET myindex/_search
{
  "_source": {
    "excludes": [
      "codes"
    ]
  },
  "query": {
    "nested": {
      "path": "codes",
      "query": {
        "bool": {
          "should": [
            {
              "match": {
                "codes.description": "J15"
              }
            },
            {
              "match": {
                "codes.id": "J15"
              }
            }
          ]
        }
      },
      "inner_hits": {}
    }
  }
}

Note: Document index will be large in size. Here only sample data mentioned.

For the second issue, can i use multi_match with AND operator like the below?

GET myindex/_search
{
  "_source": {
    "excludes": [
      "codes"
    ]
  },
  "query": {
    "nested": {
      "path": "codes",
      "query": {
        "bool": {
          "should": [
            {
              "multi_match": {
                    "query": "J15",
                    "fields": ["codes.id", "codes.description"],
                    "operator": and
                }
            }
          ]
        }
      },
      "inner_hits": {}
    }
  }
}

Any help would be really appreciated as I am having hard time in fixing this.

3

There are 3 answers

6
Amit On BEST ANSWER

Adding another answer, as its a different issue and first answer was focused on first issue.

Issue is that your second query test two returns test one world as well as while indexing you are using the ngram_analyzer which is using the standard analyzer which split the text on white-spaces and again your search analyzer is standard so if you use the Analyze API on your indexed doc and search term, you will see it matches the tokens:

{
   "text" : "test one world",
   "analyzer" : "standard"
}

And generated tokens

{
    "tokens": [
        {
            "token": "test",
            "start_offset": 0,
            "end_offset": 4,
            "type": "<ALPHANUM>",
            "position": 0
        },
        {
            "token": "one",
            "start_offset": 5,
            "end_offset": 8,
            "type": "<ALPHANUM>",
            "position": 1
        },
        {
            "token": "world",
            "start_offset": 9,
            "end_offset": 14,
            "type": "<ALPHANUM>",
            "position": 2
        }
    ]
}

And for your search term test two

{
    "tokens": [
        {
            "token": "test",
            "start_offset": 0,
            "end_offset": 4,
            "type": "<ALPHANUM>",
            "position": 0
        },
        {
            "token": "two",
            "start_offset": 5,
            "end_offset": 8,
            "type": "<ALPHANUM>",
            "position": 1
        }
    ]
}

As you can see test token was present in your document hence you get that search result. and it can be solved by using the AND operator in the query as shown below

Search query

{
    "_source": {
        "excludes": [
            "codes"
        ]
    },
    "query": {
        "nested": {
            "path": "codes",
            "query": {
                "bool": {
                    "must": {
                        "multi_match": {
                            "query": "test two",
                            "fields": [
                                "codes.id",
                                "codes.description"
                            ],
                            "operator" :"AND"
                        }
                    }
                }
            },
            "inner_hits": {}
        }
    }
}

And search results

 "hits": [
                                {
                                    "_index": "myindexedge64170045",
                                    "_type": "_doc",
                                    "_id": "1",
                                    "_nested": {
                                        "field": "codes",
                                        "offset": 2
                                    },
                                    "_score": 2.6901608,
                                    "_source": {
                                        "id": "J15.1",
                                        "description": "test two world J15.0"
                                    }
                                },
                                {
                                    "_index": "myindexedge64170045",
                                    "_type": "_doc",
                                    "_id": "1",
                                    "_nested": {
                                        "field": "codes",
                                        "offset": 3
                                    },
                                    "_score": 2.561376,
                                    "_source": {
                                        "id": "J15.2",
                                        "description": "test two three world J15"
                                    }
                                }
                            ]
                        }
                    }
                }
            }
4
ESCoder On

Adding a working example with index mapping, search query, and search result

Index Mapping:

{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "edge_ngram",
          "min_gram": 2,
          "max_gram": 20,
          "token_chars": [
            "letter",
            "digit"
          ]
        }
      }
    },
    "max_ngram_diff": 50
  },
  "mappings": {
    "properties": {
      "group": {
        "type": "text"
      },
      "codes": {
        "type": "nested",
        "properties": {
          "id": {
            "type": "text",
            "analyzer": "my_analyzer"
          }
        }
      }
    }
  }
}

Index Data:

{
    "group": "J00-I99", 
    "codes": [
        {
            "id": "J15",
            "description": "hello world"
        },
        {
            "id": "J15.0",
            "description": "test one world"
        },
        {
            "id": "J15.1",
            "description": "test two world J15.0"
        },
        {
            "id": "J15.2",
            "description": "test two three world J15"
        },
        {
            "id": "J15.3",
            "description": "hello world J18 "
        },
        {
            "id": "J15.9",
            "description": "hello world new"
        },
        {
            "id": "J16.0",
            "description": "new description"
        }
    ]
}

Search Query:

{
    "_source": {
        "excludes": [
            "codes"
        ]
    },
    "query": {
        "nested": {
            "path": "codes",
            "query": {
                "bool": {
                    "should": [
                        {
                            "match": {
                                "codes.description": "J15"
                            }
                        },
                        {
                            "match": {
                                "codes.id": "J15"
                            }
                        }
                    ],
                    "must": {
                        "multi_match": {
                            "query": "test two",
                            "fields": [
                                "codes.id",
                                "codes.description"
                            ],
                            "type": "phrase"
                        }
                    }
                }
            },
            "inner_hits": {}
        }
    }
}

Search Result:

"inner_hits": {
          "codes": {
            "hits": {
              "total": {
                "value": 2,
                "relation": "eq"
              },
              "max_score": 3.2227304,
              "hits": [
                {
                  "_index": "stof_64170045",
                  "_type": "_doc",
                  "_id": "1",
                  "_nested": {
                    "field": "codes",
                    "offset": 3
                  },
                  "_score": 3.2227304,
                  "_source": {
                    "id": "J15.2",
                    "description": "test two three world J15"
                  }
                },
                {
                  "_index": "stof_64170045",
                  "_type": "_doc",
                  "_id": "1",
                  "_nested": {
                    "field": "codes",
                    "offset": 2
                  },
                  "_score": 2.0622847,
                  "_source": {
                    "id": "J15.1",
                    "description": "test two world J15.0"
                  }
                }
              ]
            }
          }
        }
      }
4
Amit On

Issue was that by default inner_hits returns only 3 matching docs as mentioned in this official doc,

size

The maximum number of hits to return per inner_hits. By default the top three matching hits are returned.

simply add size param in your inner_hits to get all the search results.

  "inner_hits": {
                "size": 10 // note this
            }

Tried this on your sample data and see the search result for your first query which was returning only 3 search results

First query search result

   "hits": [
                                {
                                    "_index": "myindexedge64170045",
                                    "_type": "_doc",
                                    "_id": "1",
                                    "_nested": {
                                        "field": "codes",
                                        "offset": 2
                                    },
                                    "_score": 1.8687118,
                                    "_source": {
                                        "id": "J15.1",
                                        "description": "test two world J15.0"
                                    }
                                },
                                {
                                    "_index": "myindexedge64170045",
                                    "_type": "_doc",
                                    "_id": "1",
                                    "_nested": {
                                        "field": "codes",
                                        "offset": 3
                                    },
                                    "_score": 1.7934312,
                                    "_source": {
                                        "id": "J15.2",
                                        "description": "test two three world J15"
                                    }
                                },
                                {
                                    "_index": "myindexedge64170045",
                                    "_type": "_doc",
                                    "_id": "1",
                                    "_nested": {
                                        "field": "codes",
                                        "offset": 0
                                    },
                                    "_score": 0.29618382,
                                    "_source": {
                                        "id": "J15",
                                        "description": "hello world"
                                    }
                                },
                                {
                                    "_index": "myindexedge64170045",
                                    "_type": "_doc",
                                    "_id": "1",
                                    "_nested": {
                                        "field": "codes",
                                        "offset": 1
                                    },
                                    "_score": 0.29618382,
                                    "_source": {
                                        "id": "J15.0",
                                        "description": "test one world"
                                    }
                                },
                                {
                                    "_index": "myindexedge64170045",
                                    "_type": "_doc",
                                    "_id": "1",
                                    "_nested": {
                                        "field": "codes",
                                        "offset": 4
                                    },
                                    "_score": 0.29618382,
                                    "_source": {
                                        "id": "J15.3",
                                        "description": "hello world J18 "
                                    }
                                },
                                {
                                    "_index": "myindexedge64170045",
                                    "_type": "_doc",
                                    "_id": "1",
                                    "_nested": {
                                        "field": "codes",
                                        "offset": 5
                                    },
                                    "_score": 0.29618382,
                                    "_source": {
                                        "id": "J15.9",
                                        "description": "hello world new"
                                    }
                                }
                            ]
                        }
                    }
                }
            }