Elasticsearch synonyms case sensitive results

Question

Elasticsearch synonyms case sensitive results

1.8k views Asked by Evaldas Buinauskas At 18 November 2014 at 18:43

Could anyone give me a tip what I'm doing wrong. I'm trying to set up elasticsearch this way

{
    "settings" : {
        "index" : {
            "analysis" : {
                "analyzer" : {
                    "synonym" : {
                        "type" : "custom",
                        "tokenizer" : "whitespace",
                        "filter" : ["en_US", "lowercase", "synonym"]
                    }
                },
                "filter" : {
                    "synonym" : {
                        "type" : "synonym",
                        "synonyms_path" : "analysis/synonym.txt"
                    },
                    "en_US" : {
                        "type" : "hunspell",
                        "locale" : "en_US",
                        "dedup" : true
                    }
                }
            }
        }
    },
    "mappings" : {
        "jdbc" : {
            "properties" : {
                "Title" : {
                    "type" : "string",
                    "search_analyzer" : "synonym",
                    "index_analyzer" : "standard"
                },
                "Abstract" : {
                    "type" : "string",
                    "search_analyzer" : "synonym",
                    "index_analyzer" : "standard"
                }
            }
        }
    }
}

My synonym.txt file contains

beer, ale, lager, cark ale
Lithuania, Republic of Lithuania, Lithuanian

So it's time to try out my analyzer:

http://localhost:9200/jdbc/_analyze?text=beer&analyzer=synonym&pretty=true

It works as expected and returns me

{
  "tokens" : [ {
    "token" : "beer",
    "start_offset" : 0,
    "end_offset" : 4,
    "type" : "SYNONYM",
    "position" : 1
  }, {
    "token" : "ale",
    "start_offset" : 0,
    "end_offset" : 4,
    "type" : "SYNONYM",
    "position" : 1
  }, {
    "token" : "lager",
    "start_offset" : 0,
    "end_offset" : 4,
    "type" : "SYNONYM",
    "position" : 1
  }, {
    "token" : "cark",
    "start_offset" : 0,
    "end_offset" : 4,
    "type" : "SYNONYM",
    "position" : 1
  }, {
    "token" : "ale",
    "start_offset" : 0,
    "end_offset" : 4,
    "type" : "SYNONYM",
    "position" : 1
  } ]
}

However, querying this:

http://localhost:9200/jdbc/_analyze?text=Lithuanian&analyzer=synonym&pretty=true

Would return only:

{
  "tokens" : [ {
    "token" : "lithuanian",
    "start_offset" : 0,
    "end_offset" : 10,
    "type" : "word",
    "position" : 1
  } ]
}

Any tips?

Original Q&A

There are 1 answers

**keety** · Accepted Answer · 2014-11-19T12:52:46+00:00

During Analysis the token filters are applied in the order specified. So in the above case "Lithuania" is first converted to lowercase "lithuania". Since the synonym file does not contain this case version of the token no synonym conversions occur. This section in elasticsearch guide talks more about this.

There are two ways you could to go about this depending on the use case either change the order of token filters in the custom analyzer to :

  "synonym" : {
         "type" : "custom",
         "tokenizer" : "whitespace",
          "filter" : ["en_US", "synonym","lowercase"]
   }

The above would still be case sensitive but the conversion would be consistent with the above synonyms.txt

Or you could make all the synonyms in analysis/synonym.txt to be lowercase and keep the same analyzer settings as specified in the question. Example:

  beer, ale, lager, cark ale
  lithuania, republic of lithuania, lithuanian

TechQA.

Elasticsearch synonyms case sensitive results

There are 1 answers

Related Questions in JDBC

Related Questions in ELASTICSEARCH

Related Questions in LOWERCASE

Related Questions in SYNONYM

Popular Questions

Popular Tags

Trending Questions