Could anyone give me a tip what I'm doing wrong. I'm trying to set up elasticsearch this way
{
"settings" : {
"index" : {
"analysis" : {
"analyzer" : {
"synonym" : {
"type" : "custom",
"tokenizer" : "whitespace",
"filter" : ["en_US", "lowercase", "synonym"]
}
},
"filter" : {
"synonym" : {
"type" : "synonym",
"synonyms_path" : "analysis/synonym.txt"
},
"en_US" : {
"type" : "hunspell",
"locale" : "en_US",
"dedup" : true
}
}
}
}
},
"mappings" : {
"jdbc" : {
"properties" : {
"Title" : {
"type" : "string",
"search_analyzer" : "synonym",
"index_analyzer" : "standard"
},
"Abstract" : {
"type" : "string",
"search_analyzer" : "synonym",
"index_analyzer" : "standard"
}
}
}
}
}
My synonym.txt file contains
beer, ale, lager, cark ale
Lithuania, Republic of Lithuania, Lithuanian
So it's time to try out my analyzer:
http://localhost:9200/jdbc/_analyze?text=beer&analyzer=synonym&pretty=true
It works as expected and returns me
{
"tokens" : [ {
"token" : "beer",
"start_offset" : 0,
"end_offset" : 4,
"type" : "SYNONYM",
"position" : 1
}, {
"token" : "ale",
"start_offset" : 0,
"end_offset" : 4,
"type" : "SYNONYM",
"position" : 1
}, {
"token" : "lager",
"start_offset" : 0,
"end_offset" : 4,
"type" : "SYNONYM",
"position" : 1
}, {
"token" : "cark",
"start_offset" : 0,
"end_offset" : 4,
"type" : "SYNONYM",
"position" : 1
}, {
"token" : "ale",
"start_offset" : 0,
"end_offset" : 4,
"type" : "SYNONYM",
"position" : 1
} ]
}
However, querying this:
http://localhost:9200/jdbc/_analyze?text=Lithuanian&analyzer=synonym&pretty=true
Would return only:
{
"tokens" : [ {
"token" : "lithuanian",
"start_offset" : 0,
"end_offset" : 10,
"type" : "word",
"position" : 1
} ]
}
Any tips?
During Analysis the token filters are applied in the order specified. So in the above case "Lithuania" is first converted to lowercase "lithuania". Since the synonym file does not contain this case version of the token no synonym conversions occur. This section in elasticsearch guide talks more about this.
There are two ways you could to go about this depending on the use case either change the order of token filters in the custom analyzer to :
The above would still be case sensitive but the conversion would be consistent with the above synonyms.txt
Or you could make all the synonyms in analysis/synonym.txt to be lowercase and keep the same analyzer settings as specified in the question. Example: