Elasticsearch custom analyser migration from ES 1.x to ES 7.x

16 views Asked by At

Requirement:

Search for documents that contains the given phrase (which has the words in the phrase in order).

It's working fine in the setup where the ES 1.x is used. I'm trying to update the ES version to 7.x. Getting issues when updating the queries to be comply with ES 7.x version, it returns the documents which contain a part of the given phrase.

The query used against ES 1.x:

{
  "from": 0,
  "size": 500,
  "query": {
    "filtered": {
      "query": {
        "match_all": {}
      },
      "filter": {
        "and": {
          "filters": [
            {
              "or": {
                "filters": [
                  {
                    "query": {
                      "match": {
                        "document": {
                          "query": "dream comes true",
                          "type": "phrase",
                          "operator": "AND"
                        }
                      }
                    }
                  }
                ]
              }
            }
          ]
        }
      }
    }
  }
}

The query used in the migrated environment:

{
  "from": 0,
  "query": {
    "bool": {
      "must": [
        {
          "bool": {
            "should": [
              {
                "match_phrase": {
                  "document": {
                    "query": "dream comes true"
                  }
                }
              }
            ]
          }
        }
      ]
    }
  }
}

The later one gives even the partial phrase present in the document. I have checked the analyser configurations.

Both analysers (in 1.x and 7.x uses): html_strip as char_filter and uax_url_email tokenizer. Other filters are same except,the 1.x setup has standard filter which is removed in 7.x.

Even though the analyser configuration is same, there is difference in the results when analysing the phrase using http://xxxx:9200/xxxx/_analyze:

For 1.x:

{
    "tokens": [
        {
            "token": "analyzer",
            "start_offset": 7,
            "end_offset": 15,
            "type": "<ALPHANUM>",
            "position": 1
        },
        {
            "token": "docs_analyzer",
            "start_offset": 19,
            "end_offset": 35,
            "type": "<ALPHANUM>",
            "position": 2
        },
        {
            "token": "text",
            "start_offset": 43,
            "end_offset": 47,
            "type": "<ALPHANUM>",
            "position": 3
        },
        {
            "token": "dreams",
            "start_offset": 51,
            "end_offset": 57,
            "type": "<ALPHANUM>",
            "position": 4
        },
        {
            "token": "comes",
            "start_offset": 58,
            "end_offset": 63,
            "type": "<ALPHANUM>",
            "position": 5
        },
        {
            "token": "true",
            "start_offset": 64,
            "end_offset": 68,
            "type": "<ALPHANUM>",
            "position": 6
        }
    ]
}

For 7.x:

{
    "tokens": [
        {
            "token": "dream",
            "start_offset": 0,
            "end_offset": 6,
            "type": "shingle",
            "position": 0
        },
        {
            "token": "come",
            "start_offset": 7,
            "end_offset": 12,
            "type": "shingle",
            "position": 1
        },
        {
            "token": "dream",
            "start_offset": 7,
            "end_offset": 7,
            "type": "shingle",
            "position": 1
        },
        {
            "token": "come",
            "start_offset": 7,
            "end_offset": 12,
            "type": "shingle",
            "position": 2
        },
        {
            "token": "true",
            "start_offset": 13,
            "end_offset": 17,
            "type": "shingle",
            "position": 3
        },
        {
            "token": "come",
            "start_offset": 13,
            "end_offset": 13,
            "type": "shingle",
            "position": 4
        },
        {
            "token": "true",
            "start_offset": 13,
            "end_offset": 17,
            "type": "shingle",
            "position": 5
        }
    ]
}

Appreciate any help to resolve this issue.

Update

1.x:

Index mapping

"document": { 
    "type": "string",
    "boost": 8,
    "term_vector": "with_positions_offsets",
    "analyzer": "docs_analyzer",
    "similarity": "BM25"
}

Analyzer:

"docs_analyzer": {  
  "type": "custom",
  "char_filter": "html_strip",
  "filter": [  
      "standard",
      "asciifolding",
      "word_delimiter",
      "lowercase",
      "stop_filter",
      "kstem"
],
   "tokenizer": "uax_url_email"
}

7.x

"document": { 
   "type": "text",
   "boost": 8,
   "term_vector": "with_positions_offsets",
   "analyzer": "docs_analyzer",
   "similarity": "BM25"
}
"docs_analyzer": { 
     "filter": [  
         "asciifolding",
         "word_delimiter_graph",
         "lowercase",
         "stop_filter",
         "kstem"
  ],
   "char_filter": "html_strip",
   "type": "custom",
   "tokenizer": "uax_url_email"
            }
0

There are 0 answers