Let be a set index/type named customers/customer. Each document of this set has a zip-code as property. Basically, a zip-code can be like:
- String-String (ex : 8907-1009)
- String String (ex : 211-20)
- String (ex : 30200)
I'd like to set my index analyzer to get as many documents as possible that could match. Currently, I work like that :
PUT /customers/
{
"mappings":{
"customer":{
"properties":{
"zip-code": {
"type":"string"
"index":"not_analyzed"
}
some string properties ...
}
}
}
When I search a document I'm using that request :
GET /customers/customer/_search
{
"query":{
"prefix":{
"zip-code":"211-20"
}
}
}
That works if you want to search rigourously. But for instance if the zip-code is "200 30", then searching with "200-30" will not give any results. I'd like to give orders to my index analyser in order to don't have this problem. Can someone help me ? Thanks.
P.S. If you want more information, please let me know ;)
As soon as you want to find variations you don't want to use
not_analyzed
.Let's try this with a different mapping:
We're using the standard tokenizer; strings will be broken up at whitespaces and punctuation marks (including dashes) into tokens. You can see the actual tokens if you run the following query:
Add your examples:
Now the query seems to work fine:
This will also work if you just search for "211". However, this might be too lenient, since it will also find "20", "20-211", "211-10",...
What you probably want is a phrase search where all the tokens in your query need to be in the field and also in the right order:
Addition:
If the ZIP codes have a hierarchical meaning (if you have "211-20" you want this to be found when searching for "211", but not when searching for "20"), you can use the
path_hierarchy
tokenizer.So changing the mapping to this:
Using the same 3 documents from above you can use the
match
query now:"1009" won't find anything, but "8907" or "8907-1009" will.
If you want to also find "1009", but with a lower score, you'll have to analyze the zip code with both variations I have shown (combine the 2 versions of the mapping):
Add a document with the inverse order to properly test it:
Then search both fields, but boost the one with the hierarchical tokenizer by 3:
Then you can see that "1009-111" has a much higher score than "8907-1009".