Elasticsearch synonyms case sensitive results

1.8k views Asked by At

Could anyone give me a tip what I'm doing wrong. I'm trying to set up elasticsearch this way

{
    "settings" : {
        "index" : {
            "analysis" : {
                "analyzer" : {
                    "synonym" : {
                        "type" : "custom",
                        "tokenizer" : "whitespace",
                        "filter" : ["en_US", "lowercase", "synonym"]
                    }
                },
                "filter" : {
                    "synonym" : {
                        "type" : "synonym",
                        "synonyms_path" : "analysis/synonym.txt"
                    },
                    "en_US" : {
                        "type" : "hunspell",
                        "locale" : "en_US",
                        "dedup" : true
                    }
                }
            }
        }
    },
    "mappings" : {
        "jdbc" : {
            "properties" : {
                "Title" : {
                    "type" : "string",
                    "search_analyzer" : "synonym",
                    "index_analyzer" : "standard"
                },
                "Abstract" : {
                    "type" : "string",
                    "search_analyzer" : "synonym",
                    "index_analyzer" : "standard"
                }
            }
        }
    }
}

My synonym.txt file contains

beer, ale, lager, cark ale
Lithuania, Republic of Lithuania, Lithuanian

So it's time to try out my analyzer:

http://localhost:9200/jdbc/_analyze?text=beer&analyzer=synonym&pretty=true

It works as expected and returns me

{
  "tokens" : [ {
    "token" : "beer",
    "start_offset" : 0,
    "end_offset" : 4,
    "type" : "SYNONYM",
    "position" : 1
  }, {
    "token" : "ale",
    "start_offset" : 0,
    "end_offset" : 4,
    "type" : "SYNONYM",
    "position" : 1
  }, {
    "token" : "lager",
    "start_offset" : 0,
    "end_offset" : 4,
    "type" : "SYNONYM",
    "position" : 1
  }, {
    "token" : "cark",
    "start_offset" : 0,
    "end_offset" : 4,
    "type" : "SYNONYM",
    "position" : 1
  }, {
    "token" : "ale",
    "start_offset" : 0,
    "end_offset" : 4,
    "type" : "SYNONYM",
    "position" : 1
  } ]
}

However, querying this:

http://localhost:9200/jdbc/_analyze?text=Lithuanian&analyzer=synonym&pretty=true

Would return only:

{
  "tokens" : [ {
    "token" : "lithuanian",
    "start_offset" : 0,
    "end_offset" : 10,
    "type" : "word",
    "position" : 1
  } ]
}

Any tips?

1

There are 1 answers

2
keety On BEST ANSWER

During Analysis the token filters are applied in the order specified. So in the above case "Lithuania" is first converted to lowercase "lithuania". Since the synonym file does not contain this case version of the token no synonym conversions occur. This section in elasticsearch guide talks more about this.

There are two ways you could to go about this depending on the use case either change the order of token filters in the custom analyzer to :

  "synonym" : {
         "type" : "custom",
         "tokenizer" : "whitespace",
          "filter" : ["en_US", "synonym","lowercase"]
   }

The above would still be case sensitive but the conversion would be consistent with the above synonyms.txt

Or you could make all the synonyms in analysis/synonym.txt to be lowercase and keep the same analyzer settings as specified in the question. Example:

  beer, ale, lager, cark ale
  lithuania, republic of lithuania, lithuanian